Abstract
We develop a novel global perspective of the complexity of the relationships between three COVID19 datasets, the standardised percapita growth rate of COVID19 cases and deaths, and the Oxford Coronavirus Government Response Tracker COVID19 Stringency Index (CSI) which is a measure describing a country’s stringency of lockdown policies. We use a stateoftheart heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. Our findings suggest that these highly popular COVID19 statistics may project onto two lowdimensional manifolds without significant information loss, suggesting that COVID19 data dynamics are generated from a latent mechanism characterised by a few important variables. The low dimensionality imply a strong dependency among the standardised growth rates of cases and deaths per capita and the CSI for countries over 2020–2021. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. The results show how highincome countries are more prone to lie on lowdimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from COVID19. Finally, the temporal stratification of the dataset allows the examination of the intrinsic dimension at a more granular level throughout the pandemic.
Similar content being viewed by others
Introduction
Highdimensional datasets are generally challenging for statistical inference and data analysis. Their analysis is made even more challenging by longitudinal measurements and temporal autocorrelation. Fortunately, these kinds of data have a high degree of redundancy typically and, therefore, may project onto lowdimensional manifolds without losing substantive information^{1,2}. The dimensionality of these manifolds is called the intrinsic dimension (ID) of the data, and it can provide important information about the properties of datasets.
Data science methods for highdimensionality datasets have been utilised and explored in multiple contexts to aid decisionmaking and analysis during the COVID19 pandemic. For example, citywide smart card travel data has been utilised in Sydney, Australia, to cluster passenger types along multiple mobility dimensions and develop intervention strategies for disease spread^{3}. Similarly, manifold learning techniques have been applied to cellphone mobility data in the United States during the COVID19 pandemic, distinguishing mobility trends in multiple geographic regions and demographics^{4}. Others have leveraged dimensionality reduction techniques to cluster and analyse highly dimensional genome sequence data of COVID19^{5} and identified essential features in predicting the mode of instruction in American universities during the COVID19 pandemic^{6}.
Additionally, these statistical techniques have enabled decisionmakers to parse the large body of communication transmitted online during the COVID19 pandemic to glean new insights. Uniform manifold approximation and projection and latent Dirichlet allocation have been used to parse Twitter data during the COVID19 pandemic and distinguish topics, identify trends and patterns in social network behaviours^{7,8}. In another example, Doanvo et al.^{9} utilise similar techniques to analyse a large body of open access COVID19 research studies and classify research output to identify existing knowledge gaps in research.
However, there has been little work done to explore the latent dynamics of the pandemic spread across continents and countries. Sivakumar and Deepthi^{10} examined the temporal dynamics of COVID19 daily cases and deaths in 40 countries, using a False Nearest Neighbour method to identify the relevant embedding dimension (ED) for each country. The authors recognise that new COVID19 cases and deaths exhibit a low to mediumlevel ED. However, it is essential to note that ED does not account for points in a dataset lying on lowdimensional manifolds. Thus, identifying the ID is generally more valuable as it accounts for inherent structures in the data and remains a more accurate representation of underlying structural complexity in a dataset^{11,12}. This research work will seek to bridge an important gap and provide valuable information towards understanding the complexity and dimensionality of the COVID19 pandemic in different countries, and develop a deeper understanding of the spread of the pandemic.
This paper provides an application of the recent heterogeneous ID algorithm (Hidalgo). Hidalgo is a Bayesian mixture model capable of clustering the observations into groups characterised by similar IDs. The ID can be considered an indicator of the complexity of the data: the higher its value, the larger the number of relevant directions are required to represent the data points faithfully. More information about ID may be found in the next section, and more formal definitions of ID can be found elsewhere^{2,13}.
The vast majority of statistical methods assume and estimate a unique value for the ID. However, this assumption is often too strong for datasets containing information generated by intricated systems with complex dynamics, such as a global pandemic. Hidalgo extends this framework, allowing the presence of multiple manifolds characterised by different ID values in the same dataset. The Bayesian local ID estimator has been applied successfully to a diverse range of datasets for scenarios such as financial markets, neuroimaging, proteomics^{14}, genomics^{15}, and highresolution player tracking data^{11}. Here, we seek to organise the pandemic dynamics of different countries into groups with similar ID to help us unveil complex patterns related to the dynamics of the COVID19 pandemic. Finally, we temporally stratify the dataset to examine the ID at a more granular level throughout the COVID19 pandemic.
Methods
Likelihoodbased intrinsic dimension estimation
A large number of ID estimators are currently available in the literature. Likelihoodbased estimators are particularly appealing because of their theoretical foundation and the immediate ability to provide estimates for uncertainty quantification.
Recently, building on previous work^{1,16}, the ‘Two Nearest Neighbours’ (TWONN) estimator was introduced^{17}, based on the following distributional result. Assume we observe n units in a dataset of nominal dimension D (intuitively, the number of recorded columns in a tall dataset), where the data lie on a manifold of smaller dimension d, the ID. In other words, some dimensions are irrelevant, or there may be a functional relationship between two or more coordinates. From a modeling perspective, we consider the dataset as a configuration obtained from a Poisson point process over \(\mathbb {R}^D\) characterised by a homogeneous intensity function \(\rho \). In that case, one can prove that the ratio of the distances between a given point and its second and first nearest neighbours (NN) is Pareto distributed with shape parameter d and scale parameter identically equal to 1. Algebraically, denoting with \(r_{i,j}\) the distance between the ith point and its jth NN, we have:
Although the theoretical derivation requires a uniform intensity of the point process, the result in Eq. (1) is empirically valid as long as the homogeneity assumption holds up to the second NN for every point.
As previously mentioned, methods that return a unique ID value to describe the entire dataset can often be limiting and unrealistic since data may lie on multiple latent ID manifolds. This shortcoming has been addressed^{14} by partitioning the data in subgroups characterised by locally homogenous ID via a Bayesian mixture model^{18}. We now suppose that the ratios \(\mu _{i}\), for \(i=1,\ldots ,n\), are potentially generated from L different Pareto distributions, obtaining:
where \(\varvec{\pi }=\left( \pi _1,\ldots ,\pi _L\right) \) is the vector of mixture weights and \(\varvec{d}=\left( d_1,\ldots ,d_L\right) \) is a vector containing the ID parameters. The Bayesian model is completed with prior distribution specifications. In particular, the authors chose identically distributed and independent conjugate Gamma priors for each element of \(\varvec{d}\), with shape and rate parameters \(a_d>0\) and \(b_d>0\), respectively: \(d_l\sim Gamma(a_{d},b_{d})\,\,\forall \,\,l\). Moreover, a Dirichlet prior for the mixture weights is adopted: \(\varvec{\pi }\sim Dirichlet(\alpha _1,\ldots , \alpha _L)\), where \(\varvec{\alpha }=(\alpha _1,\ldots , \alpha _L)\) is a vector of positive concentration parameters.
Here, we adopt a sparse mixture specification^{19,20} which permits, similarly to a nonparametric approach, a distinction between the number of fitted components L and the number of estimated clusters, \(L^*\), which coincide with the populated components. To this aim, a careful choice of the vector \(\varvec{\alpha }\) (e.g., by setting all its entries to small values, say \(\alpha _l \le 0.05,\,\,\forall l\)) allows the method to automatically select the necessary number of mixture components \(L^*\le L\), preventing the need to fit multiple models with different values of L and then rely on posthoc solutions, such as the comparison of information criteria (e.g., AIC, BIC) or marginal likelihood to select the best model. Indeed, within this context, the value L in Eq. (2) is interpreted as an upper bound on the number of populated clusters, and the actual number of manifolds is directly estimated by the data.
As customary in Bayesian mixture models, we can augment the parameter space to enhance inference and ease posterior computation adding the auxiliary parameters \(c_i\in \{1,\ldots ,L\}\), for \(i=1,\ldots ,n\).
These latent membership labels link each observation to a cluster. In other words, \(c_i=l\) implies that the ith unit is assigned to the lth mixture component. Unfortunately, even given this expansion to the model space, fitting the model presented in Eq. (2) is exceptionally challenging: the overlaying support of the Pareto distributions jeopardises the clustering assignment, which in turn prevents the derivation of reliable estimates of the ID. To address this issue^{14}, enhanced the model by introducing a local homogeneity assumption, postulating that data points close to each other are more likely to lie on the same latent manifold and, therefore, should be clustered together. This way, the clustering is aided by spatial information about the data points, which was previously ignored. In particular, the authors make use of the \(n\times n\) binary similarity matrix \(\mathscr {N}^{(q)}\), with a generic entry defined as \(\mathscr {N}^{(q)}_{ij} = 1\) if the jth observation is among the first q NNs of the ith observation. To enforce local homogeneity, \(\mathbb {P}\left[ \mathscr {N}^{(q)}_{i,j} = 1\mid c_i = c_j\right] = \zeta \) and \(\mathbb {P}\left[ \mathscr {N}^{(q)}_{i,j} = 1\mid c_i \ne c_j\right] = 1\zeta \). This model extension leads to the following specification:
where \(\mathscr {Z}_i\) is a normalizing constant and \(Cat_{L}\) denotes a Categorical distribution over the set \(\{1,\ldots ,L\}\). A closedform expression for the posterior distribution is not available, so we rely on Markov Chain Monte Carlo (MCMC) techniques to simulate a posterior sample. The interested reader can find more technical discussions of this model specification and the validity of the underlying hypothesis in the Supplementary Material of related papers^{14,15}. In these references, one can also find more details about the Gibbs sampler algorithm used for fitting the model and the postprocessing tools adopted to deal with computational issues such as labelswitching. In this work, we apply the model defined by Eq. (2) and the corresponding Hidalgo algorithm^{14} to assess global COVID19 disease dynamics. More details are provided in the following subsection.
Data description
This work utilises three datasets to explore the disease and spreading dynamics of COVID19 in countries: COVID19 new cases, deaths per million population (pmp)^{21}, and the COVID19 Stringency Index (CSI) from the Oxford Coronavirus Government Response Tracker (OxCGRT)^{22} (now referred to as CSI). The CSI describes the stringency of government measures by recording the number of government policies in each country and their strictness. The index is a composite measure based on nine response indicators, including school and workplace closures, travel bans, etc. These indicators are rescaled to a value from 0 to 100 (100 = strictest response). Together, these three datasets represent the health and social representation of the effects of COVID19 on each country. The CSI has informed studies in the health sciences, such as estimating the impact of various physical distancing measures on disease incidence^{23} and relating different levels of healthcare resources to the associated transmission risk^{24,25}. Political scientists have employed the CSI to consider whether stringency measures vary by regime type^{26,27}, and whether upcoming elections influenced the strength of responses^{28}.
We source the datasets from the Our World in Data ‘Data Explorer’, which formats and aggregates a variety of datasets from academic and public institutions globally^{21}. Our World in Data sources data on worldwide COVID19 cases and deaths from the COVID19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University^{29}. Given the open nature of these datasets, no ethical approval or specific permissions were required for this study.
Each row of the final dataset contains a country index and the corresponding concatenated relevant timeseries (datasets of cases, deaths, and stringency index). Figure 1 provides an excerpt of the combined dataset.
The dataset spans a period of 454 days from \(1^{\text {st}}\) Mar 2020 to \(29^{\text {th}}\) May 2021 and initially included 190 countries. Given that this analysis includes three datasets each containing 454 temporal measurements, the nominal dimension of the dataset is \(\textit{D} = 454 \times 3 = 1362\).
To improve the robustness of the study, we included countries if they meet certain data availability and size requirements. Countries with more than 20% of missing data in the given period for any of the three timeseries were excluded. Any remaining missing values were imputed with a linear regression using the imputeTS package in R^{30,31}. Additionally, countries with smaller populations (\(< 1\) million) can display higher volatility in new cases and deaths pmp. Thus, such countries were excluded from the dataset to limit the influence of outliers in the analysis. These data preprocessing procedures leave \(n=115\) countries.
All data manipulation or transformation tasks performed as part of the preprocessing methodology were undertaken using through the tools available in the tidyverse package in R^{31,32}. The completed preprocessed dataset and the corresponding code to replicate the data preparation methodology, results, and figures are available on Github.
Additionally, the original dataset may be temporally stratified into four equally separate pandemic stages to reveal the ID at a more granular level in the dataset. The original dataset described in Table 1 documents the date range of each stage.
Hidalgo requires unique observations in a dataset^{14} and thus performs optimally on datasets with continuous data or discrete numbers within a broad range. Therefore, this analysis scales new cases and deaths by the country population to satisfy this assumption and enable disease dynamics to be compared across countries with different populations. Second, Hidalgo assumes identically and independently distributed observations in a dataset. To limit temporal autocorrelation, two preprocessing steps are applied to the chosen datasets. Firstly, ‘new’ cases, and deaths pmp are selected, as opposed to their ‘active’ or ‘cumulative’ counterparts. Additionally, each of the three datasets are normalised to zscores across all countries, given by Eq. (4):
where x represents a dataset, \(k \in [1,2,3]\) denotes each of the three datasets used in this analysis, and \(\bar{x}_{k}\) and \(s_{k}\) represent the mean and standard deviation of a dataset respectively.
Computational details
Hidalgo was run on this dataset for 25,000 MCMC iterations, after a burnin of 1000. The fast convergence was confirmed by running a secondary analysis with 10000 iterations and obtaining the same results. A sparse mixture modelling approach^{19,20} is employed in this analysis, with \(L = 6\) mixture components, and \(\alpha = 0.05\) for the Dirichlet priors of the mixture weights. Three matrices are produced as the output^{15}:

1.
Membership labels (dim: \(nsim \times n\)) where each column contains the MCMC sample of the membership labels for every observation;

2.
Cluster probabilities (dim: \(nsim \times L\)) where each column contains the MCMC sample of the mixing weights for each mixture component;

3.
Intrinsic dimensions (dim: \(nsim \times L\)) where each column contains an MCMC sample for every ID parameter estimated in each cluster.
The MCMC chains produced by Hidalgo may exhibit labelswitching issues, which prevents direct extraction of inference from the MCMC output. Indeed, labelswitching arises whenever a mixture model with apriori symmetric components is adopted. Due to labelswitching, mixture components can be discarded, emptied, or repopulated across iterations.
To obtain a reliable clustering estimate, one can inspect the posterior coclustering matrix \(PCM=\{p_{ij}\}\) computed across the n countries, where each entry \(p_{ij}\) is defined as the proportion of times that countries i and j have been clustered together across the nsim MCMC iterations. Once the PCM is estimated, one can recover a clustering estimation by minimising a loss function over the space of the possible partitions. A widely used method is the minimisation of the Variation of Information (VI) loss function^{33,34}. This way, we can estimate the number of latent ID manifolds in the dataset. Moreover, we can also obtain more specific results by following a postprocessing procedure^{15}, devised to address the labelswitching issue. The algorithm that is used maps the L different parameterspecific chains – one for each mixture component parameter \(\{d_l\}_{l=1}^L\) – to n observationspecific chains \(\{d_{c_i}\}_{i=1}^n\). This way, not only are we able to draw inferences about the clusters characterised by heterogeneous ID present in the data, but we can also focus on the observationspecific ID estimates. Thus, in our application, we can compare the different countryspecific ID estimates in addition to ID estimates of latent manifolds in the dataset.
Results and discussion
Global COVID19 data is characterised by low complexity
A summary of the ID analysis of global data is shown in Fig. 2. In particular, Fig. 2E highlights the posterior distribution of IDs in each cluster group, from which we can obtain a visual estimate of the variability of the ID estimates in each cluster. Hidalgo automatically identifies two manifolds (\(L^*=2\)) of posterior mean IDs \(d_1=12\) and \(d_2=9\), indicating the COVID19 disease dynamics and corresponding governmentestablished nonpharmaceutical interventions (NPIs) display higher redundancy in some countries than others. Countries assigned a higher ID indicate complex dynamics, as Hidalgo identifies these points project onto a highdimensional manifold. Conversely, countries with a lower ID suggest simpler dynamics, as Hidalgo identifies these points project onto a lowdimensional manifold.
Given the high dimensionality of the dataset, IDs of 12 and 9 represent a dimensionality reduction of 99.34 and 99.11% respectively, suggesting strong dependence on the standardised new cases per million population (pmp), new deaths pmp, and the CSI for a country over the given period. Notably, these results indicate that a small set of parameters govern the COVID19 dynamics, which has important implications for practitioners seeking to model these dynamics or apply dimensionality reduction techniques. For example, authors such as^{35} have identified that lower IDs lower the sample complexity of learning, enabling more accessible learning for neural networks and better model generalisation from training to test data.
Despite the overall low dimensionality of the dataset, the two ID manifolds identified differ by at least three dimensions. This result warrants further inspection to examine potential explanations for the dimensionality of each ID manifold.
Global distribution of COVID19 data complexity demonstrates spatial autocorrelation
Notably, we identify evidence of spatial autocorrelation in the ID of global COVID19 data, supported by Fig. 2D,F. Figure 2D highlights the individual ID of each country included in the dataset. The colour of each country code corresponds to the ID manifold to which each country belongs. The ID manifold of each country may also be presented geographically on a map, as displayed in Fig. 2F. Upon visual examination, Fig. 2D,F demonstrate that countries geographically close together tend to belong to the same ID manifold.
We confirmed these results using the Moran’s I test, which is a widely used spatial statistic for detecting spatial autocorrelation^{36,37}. Moran’s I ranges from 1 to 1, and is defined as:
where N is the number of spatial units indexed by i and j; x is the individual median posterior ID estimate of a country; \(\bar{x}\) is the mean of x ; \(w_{ij}\) is a neighbour adjacency matrix with zeroes on the diagonal (i.e., \(w_{ii}=0\)); and W is the sum of all \(w_{ij}\)^{36}. In line with common approaches, we assign a weight of 1 for neighbouring zones and 0 otherwise^{37}. A neighbourhood is defined such that every country has at least one neighbour in the spatial weights matrix.
Applying this test to the geographical distribution of median posteriors of ID produces an I value of 0.85 (\(p<0.001\)) using the spdep package in R^{31,38}, indicating significant positive spatial autocorrelation.
This result is a compelling finding, as the input dataset does not include any information about the countries geographical locations. Neighbouring countries may share the complex dynamics of COVID19 as the pandemic spreads worldwide, resulting in positive spatial autocorrelation in the distribution of median posterior IDs of countries included in the analysis^{39} suggests that geographically close countries are likely to share spatiotemporal dynamics due to human spatial dynamics and similar demographic factors across geographic regions. In reviewing the available literature^{40}, highlights that a country’s interconnectedness influences the spreading dynamics of COVID19. This literature suggests that geographical closeness and interconnectivity have substantial implications for the spreading dynamics of COVID19, allowing this to be a potential explanation for the spatial autocorrelation identified in the complexity of spreading dynamics observed in the analysis.
Highincome countries are characterised by lower complexity data
Our analysis reveals that countries with higher income level groups are more likely to lie in lowdimensional manifolds. Figure 3 presents the distribution of income levels across the two ID manifolds. The World Bank assigns one of four income levels to each country, ranging from low to highincome^{41}. For the 2022 fiscal year, lowincome countries fall under a Gross National Income (GNI) per capita of $1,045 (USD) or less in 2020; lowermiddleincome between $1,046 and $4,095; uppermiddleincome between $4,096 and $12,695; and highincome from $12,696 or more. GNI per capita represents the value produced by each person in a country’s economy in a given year, regardless of whether the source of the value created is domestic production or receipts from overseas. While the GNI per capita does not entirely summarise a country’s level of development or welfare, it has proved to be a useful and readily available indicator that closely correlates with other, nonmonetary measures of the quality of life, such as life expectancy at birth, mortality rates of children, and enrollment rates in school^{42}.
A range of factors may explain the skewed distribution of income levels towards the lowID manifold. Highincome countries usually have aging populations, arising from declining fertility and improving mortality due to income growth, changes in health behaviours, and higher education levels^{43,44}. Aging populations in many highincome countries have played a role in creating a greater mortality burden during the COVID19 pandemic over 2020 to 2021 due to increased vulnerability to serious infections in population groups aged over 65^{45}. Additionally, underlying diseases such as diabetes, cardiovascular disease, and other diseases significantly contribute to increased severity risk from COVID19^{46}. Importantly, chronic medical conditions are widely prevalent in aging populations in highincome countries^{47}. These factors have significantly impacted the mortality burden per capita over the COVID19 pandemic. Research from the World Bank^{45,48} estimate that highincome countries have had 2 to 3 times the COVID19 mortality burden per capita compared to other countries over 2020. The age distribution disparity across the two ID manifolds is evident in Fig. 4A,B.
Figure 4A reveals that countries assigned to a highID manifold had less than 7% of the population aged over 65 on average. In comparison, countries assigned to a highID manifold have 13% of the population aged over 65 on average, despite displaying bimodality due to some lowincome countries in the lowID manifold. A KolgomorovSmirnov test may be applied to evaluate the null hypothesis that the distributions are sampled from a population with the same distribution, which is subsequently rejected at the \(p < 0.001\) significance level^{49}. The mean age distribution of countries in each ID manifold presented in Fig. 4B corroborates that countries assigned to a lowID manifold host a higher proportion of the population aged over 65, while countries assigned to a highID manifold host a higher proportion of the population aged between 0 and 14.
A possible explanation for highincome countries being assigned a lowID manifold arises after identifying a link between highincome countries, aging populations, and increased per capita mortality burden. Namely, since new deaths pmp are a subset of new cases pmp, increased new deaths pmp in highincome countries may provide explanatory power to new cases pmp, resulting in greater dependency between the two timeseries datasets and thus requiring a lower ID. Conversely, lower rates of reported new deaths pmp in lowincome countries would lower the dependence in the entire dataset included in the analysis requiring a higher number of dimensions to represent the data accurately.
Furthermore, issues in data quality in COVID19 data in low and middleincome countries are widely researched. They may be another factor contributing towards the existing distribution of countries and corresponding designations to ID manifolds^{45,50} identifies that underreporting in deaths varies globally but is highest in lowincome and fragile settings. Such data artifacts could lead to a higher number of unexplained values, thereby lowering the dependence in the dataset and requiring more dimensions to describe the data effectively.
Changes in ID over stages of the COVID19 pandemic
Stratifying the datasets has provided a granular view of the ID over the course of the pandemic, and a summary of the results for each stage is presented in Figs. 5, 6, 7 and 8.
We can observe that countries lie between 2 to 3 ID manifolds throughout the pandemic. From March to June 2020 (Stage 1), manifolds have a similar ID which could reflect a generally united global response to the pandemic (Fig. 5, \(d_1 = 9, d_2 = 8.6\)). In June 2020 to October 2020 however, we find that the data lies on 3 different manifolds (Fig. 6, \(d_1 = 10, d_2 = 9.2, d_3 = 9.75\)). These 3 ID manifolds continue from October 2020 to February 2021, with all 3 manifolds lying on an ID between 9 and 10 (Fig. 7, \(d_1 = 10, d_2 = 9.2, d_3 = 9.75\)). Countries belonging to the manifold with the ID of 5.9 are mostly European (e.g., France, Italy). These countries experience a rise in the growth rate of cases and deaths, which precedes a corresponding rise in countries lying on the manifold with an ID of 6.9 (e.g., US, Spain, UK, Russia). Meanwhile, other countries with an ID of 9.2 continue to experience the average growth rates in cases and deaths (e.g., Australia, China, India, and much of South America). Finally, from February to May 2021, some countries lie on one clear manifold, with an ID of 7.5 (Fig. 8, \(d_1 = 7.5, d_2 = 6.4\)).
Implications and future research
We have successfully identified heterogenous ID manifolds for a dataset incorporating publicly available COVID19 data such as government stringency levels, cases, and death rates per capita utilising Hidalgo, a Bayesian mixture model. Applying this model to the dataset reveals low intrinsic dimensionality, highlighting a potential for significant dimensionality reduction in the dataset. These findings suggest that few independent dimensions are required to effectively describe the dataset, enabling practitioners to discern better the level of model complexity required when describing or forecasting such data.
Furthermore, we demonstrate how heterogenous ID estimators like Hidalgo may be employed to partition and simplify highdimensional datasets. We reveal interesting spatial and demographic patterns in data that capture the unfolding of the global pandemic. It may be valuable for practitioners to consider these tools as part of their arsenal, to quantify data complexity and heterogeneity meaningfully, as part of a quest to effectively extract useful information contained in highdimensional data.
Ultimately, the results of this analysis are subject to the quality of data available. While every effort has been made to correct for issues in the data, the inherent discrepancy in data quality across countries inevitably affects the results of this analysis. As previously ascertained^{45,51}, it is currently infeasible to account for all underreporting and data quality issues for specific countries and therefore remain an artifact in this dataset. It is also important to note that inherent assumptions in the Hidalgo algorithm require a careful choice of datasets in addition to some scaling transformations to limit temporal autocorrelation. These requirements limit the immediate applicability of Hidalgo for time series datasets as analysis must be conducted on standardised firstorder differences with continuous values (e.g. new cases pmp, new deaths pmp), precluding practitioners from considering more intuitive datasets like cumulative cases or deaths.
In our analysis, we emphasized the interpretation of the spatial characteristics of the results. Nonetheless, we acknowledge that further studies can be conducted to deepen the understanding of our findings from a temporal point of view. One option is to temporally align the data by considering the first date a COVID19 case was reported. This temporal restructuring of the data would provide valuable insights into the temporal evolution of the pandemic and its impact on different regions over time. By examining the dynamics of the outbreak from a dynamic system perspective, we can gain a deeper understanding of how the pandemic unfolded and its varying effects across regions. We are mindful of the importance of investigating these temporal features, and we plan to pursue this line of research in the future.
Moreover, it would be valuable to conduct further examination on other factors contributing to the complexity (ID) of the COVID19 data dynamics of a country to better understand drivers for complexity in pandemics. While we have identified that income level, age distribution, disease burden, and data quality all play a role in determining the ID of a country, developing a more nuanced understanding of these contributing factors would provide utility to the broader scientific community. For example, this could encompass additional significant socioeconomic and environmental covariates^{52,53}.
Finally, from a methodological perspective, we recall that Hidalgo is based on ratios of distances between a point and its first and second NNs. In principle, one could rely on ratios of distances from NNs of generic order as a suitable estimator for homogeneous ID has been recently proposed by^{54}. Future work is needed to extend this methodology to a mixture framework to account for the presence of heterogeneous IDs. Although considering larger neighbourhoods leads to a reduction of the estimator variance, we remark that considering more generic ratios would imply more stringent assumptions. These assumptions, such as a broader local homogeneity, may be violated when working with realworld data.
Conclusions
This work evaluated the complexity of a dataset consisting of the standardised percapita growth rate of COVID19 cases, deaths, and an index describing a country’s stringency of NPI measures (CSI), using a heterogenous intrinsic dimension estimator implemented as a Bayesian mixture model (Hidalgo). We identify that the COVID19 dataset may be projected onto two lowdimensional manifolds (\(d_{1} = 12\), \(d_{2} = 9\)). Lower dimensionality suggests stronger dependence in the standardised growth rates of cases and deaths per capita and the CSI for a country over the given period. Notably, it indicates that COVID19 data dynamics are governed by a small set of parameters, which has important implications for practitioners seeking to model these dynamics or apply dimensionality reduction techniques on this data.
This work has demonstrated how the intrinsic dimension can help extract novel insights across multiple complex datasets and identify engaging ways to effectively segregate data. For example, we identify spatial autocorrelation in the distribution of ID estimates for countries. Furthermore, we highlight a skewed distribution of highincome countries projected on a lowdimensional ID manifold due to the increased per capita mortality burden from COVID19 arising from aging populations and the increased prevalence of comorbidities. While we make significant progress towards understanding drivers for complexity in the included COVID19 datasets, developing a more nuanced understanding of these contributing factors would enable decisionmakers to better account for complexity in pandemics and is identified as an area of future research.
Data availability
The datasets used in this paper are publicly accessible and are sourced from the Our World in Data website (ourworldindata.org). No request for access and ethics approvals were required to retrieve the data used in this paper. The aggregated dataset used in this analysis is available at https://github.com/EdgarSantosFernandez/covid19_ID.
Code accessibility
R code to reproduce the outcomes and plots presented in this paper can be found at https://github.com/EdgarSantosFernandez/covid19_ID.
References
Levina, E. & Bickel, P. J. Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, 777–784 (2005).
Camastra, F. & Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci. 328, 26–41. https://doi.org/10.1016/j.ins.2015.08.029 (2016).
Shoghri, A. E., Liebig, J., Jurdak, R., Gardner, L. & Kanhere, S. S. Identifying highly influential travellers for spreading disease on a public transport system. IEEE Comput. Soc.https://doi.org/10.1109/WoWMoM49955.2020.00020 (2020).
Levin, R., Chao, D. L., Wenger, E. A. & Proctor, J. L. Cell phone mobility data and manifold learning: Insights into population behavior during the COVID19 pandemic. preprint, Epidemiology (2020). https://doi.org/10.1101/2020.10.31.20223776.
Wisesty, U. N. & Mengko, T. R. Comparison of dimensionality reduction and clustering methods for SARSCoV2 genome. Bull. Electr. Eng. Inform. 10(4), 2170–2180. https://doi.org/10.11591/eei.v10i4.2803 (2021).
Hearn, A. The HigherEd Coronavirus Response in the Public Sector: The Usage of Dimensionality Reduction Techniques and Feature Importance Algorithms to Analyze Fall ReOpening Plans. online manuscript (2020).
Pierri, F. et al. Online misinformation is linked to early covid19 vaccination hesitancy and refusal. Sci. Rep. 12, 5966 (2022).
Ordun, C., Purushotham, S. & Raff, E. Exploratory Analysis of Covid19 Tweets using Topic Modeling, UMAP, and DiGraphs. arXiv:2005.03082 [cs] (2020).
Doanvo, A. et al. Machine Learning Maps Research Needs in COVID19 Literature. Patterns1, 100123–100123, https://doi.org/10.1016/j.patter.2020.100123 (2020). Place: United States Publisher: Elsevier Inc.
Sivakumar, B. & Deepthi, B. Complexity of COVID19 Dynamics. Entropy 24, 50. https://doi.org/10.3390/e24010050 (2021).
SantosFernandez, E., Denti, F., Mengersen, K. & Mira, A. The role of intrinsic dimension in highresolution player tracking data—Insights in basketball. Annals of Applied Statistics 16(1), 326–348 (2022).
Eneva, E., Kumaraswami, K. & Matteucci, M. Wekkem: A study in fractal dimension and dimensionality reduction. In Workshop on Fractals and Selfsimilarity in Data Mining: Issues and Approaches (2002).
Bishop, C. M. Neural Networks for Pattern Recognition (Clarendon Press, 1995).
Allegra, M., Facco, E., Denti, F., Laio, A. & Mira, A. Data segmentation based on the local intrinsic dimension. Sci. Rep. 1–27. https://doi.org/10.1038/s41598020722220 (2019).
Denti, F. intRinsic: An R package for modelbased estimation of the intrinsic dimension of a dataset. J. Stat. Softw. 106, 1–45. https://doi.org/10.18637/jss.v106.i09 (2023) arXiv:2102.11425.
MacKay, D. & Ghahramani, Z. Comments on ‘Maximum Likelihood Estimation of Intrinsic Dimension’ by E. Levina and P. Bickel (2004). Comment on personal webpage (2005).
Facco, E., d’Errico, M., Rodriguez, A. & Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Sci. Rep. 7, 12140 (2017).
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis (Chapman and Hall/CRC, 1995).
Rousseau, J. & Mengersen, K. Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 73, 689–710. https://doi.org/10.1111/j.14679868.2011.00781.x (2011).
MalsinerWalli, G., FrühwirthSchnatter, S. & Grün, B. Modelbased clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324. https://doi.org/10.1007/s1122201495002 (2016).
Ritchie, H. et al. Coronavirus Pandemic (COVID19). Our World in Data (2020).
Hale, T. et al. A global panel database of pandemic policies (Oxford COVID19 Government Response Tracker). Nat. Hum. Behav.https://doi.org/10.1038/s41562021010798 (2021).
Edejer, T.T.T. et al. Projected healthcare resource needs for an effective response to COVID19 in 73 lowincome and middleincome countries: a modelling study. Lancet Glob. Health 8, e1372–e1379. https://doi.org/10.1016/S2214109X(20)303831 (2020).
Islam, N. et al. Physical natural experiment in 149 countries. BMJ 370, m2743. https://doi.org/10.1136/bmj.m2743 (2020).
Hale, T. et al. Government responses and COVID19 deaths: Global evidence across multiple pandemic waves. PLOS ONE 16, e0253116. https://doi.org/10.1371/journal.pone.0253116 (2021).
Hale, T. et al. Pandemic Governance Requires Understanding Socioeconomic Variation in Government and Citizen Responses to COVID19. SSRN Scholarly Paper ID 3641927, Social Science Research Network, Rochester, NY (2020).
Frey, C. & Chen, C. Democracy, Culture, and Contagion: Political Regimes and Countries Responsiveness to Covid19* (2020).
Pulejo, M. & Querubín, P. Electoral Concerns Reduce Restrictive Measures During the COVID19 Pandemic. Working Paper 27498, National Bureau of Economic Research (2020). https://doi.org/10.3386/w27498. Series: Working Paper Series.
Dong, E., Du, H. & Gardner, L. An interactive webbased dashboard to track COVID19 in real time. Lancet Infect. Dis. 20, 533–534. https://doi.org/10.1016/S14733099(20)301201 (2020).
Moritz, S. & BartzBeielstein, T. imputeTS: Time series missing value imputation in R. R J. 9, 207–218 (2017).
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2021).
Wickham, H. et al. Welcome to the tidyverse. J. Open Source Softw. 4, 1686. https://doi.org/10.21105/joss.01686 (2019).
Meilă, M. Comparing clusteringsan information based distance. J. Multivar. Anal. 98, 873–895. https://doi.org/10.1016/j.jmva.2006.11.013 (2007).
Wade, S. & Ghahramani, Z. Bayesian cluster analysis: point estimation and credible balls (with discussion). Bayesian Anal. 13, 559–626. https://doi.org/10.1214/17BA1073 (2018) arXiv:1505.03339.
Pope, P., Zhu, C., Abdelkader, A., Goldblum, M. & Goldstein, T. The Intrinsic Dimension of Images and Its Impact on Learning. ICLR 2021 Conference (2021).
Moran, P. A. P. Notes on continuous stochastic phenomena. Biometrika 37, 17–23. https://doi.org/10.2307/2332142 (1950).
Jackson, M. C., Huang, L., Xie, Q. & Tiwari, R. C. A modified version of Moran’s I. Int. J. Health Geogr. 9, 33. https://doi.org/10.1186/1476072X933 (2010).
Bivand, R. S., Pebesma, E. & GomezRubio, V. Applied Spatial Data Analysis with R 2nd edn. (Springer, 2013).
McLafferty, S. Placing pandemics: Geographical dimensions of vulnerability and spread. Eurasian Geogr. Econ. 51, 143–161. https://doi.org/10.2747/15397216.51.2.143 (2010).
McMahon, T., Chan, A., Havlin, S. & Gallos, L. K. Spatial correlations in geographical spreading of COVID19 in the United States. Sci. Rep. 12, 699. https://doi.org/10.1038/s41598021046532 (2022).
Hamadeh, N., van Rompaey, C. & Metreau, E. New World Bank country classifications by income level: 20212022 (2021).
International Economics Department. Per Capita Income: Estimating Internationally Comparable Numbers (1989).
Ho, J. Y. & Hendi, A. S. Recent trends in life expectancy across high income countries: Retrospective observational study. BMJ 362, k2562. https://doi.org/10.1136/bmj.k2562 (2018).
Lee, J. & Smith, J. P. Health, Economic Status, and Aging in HighIncome Countries (National Academies Press (US), 2018). Publication Title: Future Directions for the Demography of Aging: Proceedings of a Workshop.
Schellekens, P. & Sourrouille, D. COVID19 Mortality in Rich and Poor Countries: A Tale of Two Pandemics? (World Bank, 2020).
Booth, A. et al. Population risk factors for severe disease and mortality in COVID19: A global systematic review and metaanalysis. PLOS ONE 16, e0247461. https://doi.org/10.1371/journal.pone.0247461 (2021).
OforiAsenso, R. et al. Recent patterns of multimorbidity among older adults in highincome countries. Popul. Health Manag. 22, 127–137. https://doi.org/10.1089/pop.2018.0069 (2019).
Bayati, M. Why is COVID19 more concentrated in countries with high economic status?. Iran. J. Public Health 50, 1926–1929. https://doi.org/10.18502/ijph.v50i9.7081 (2021).
Kolmogorov, A. N. & Smirnov, V. I. KolmogorovSmirnov Test. In The Concise Encyclopedia of Statistics, 283–287, https://doi.org/10.1007/9780387328331_214 (Springer, New York, NY, 2008).
Whittaker, C. et al. Underreporting of deaths limits our understanding of true burden of covid19. BMJ 375, n2239. https://doi.org/10.1136/bmj.n2239 (2021).
LloydSherlock, P., Sempe, L., McKee, M. & Guntupalli, A. Problems of data availability and quality for COVID19 and older people in low and middleincome countries. Gerontol.https://doi.org/10.1093/geront/gnaa153 (2020).
Weaver, A. K., Head, J. R., Gould, C. F., Carlton, E. J. & Remais, J. V. Environmental factors influencing covid19 incidence and severity. Annu. Rev. Public Health 43, 271–291 (2022).
Kong, J. D., Tekwa, E. W. & GignouxWolfsohn, S. A. Social, economic, and environmental factors influencing the basic reproduction number of covid19 across countries. PLoS ONE 16, e0252373 (2021).
Denti, F., Doimo, D., Laio, A. & Mira, A. The generalized ratios intrinsic dimension estimator. Sci. Rep.https://doi.org/10.1038/s41598022209911 (2022).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
Acknowledgements
AV received funding from the Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers for Big Data, Big Models and New Insights (ACEMS) under Grant Number CE140100049 and the First Byte Grant through the Centre for Data Science at the Queensland University of Technology. KM was supported by an ARC Laureate Fellowship under Grant Number FL150100150. AM was supported by FISR 2020 COVID No. FISR2020IP_03843 and by European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 101016233. The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation. Data visualizations, including the maps, were generated using the R package ggplot2 (3.4.1)^{55}.
Author information
Authors and Affiliations
Contributions
A.V. contributed to data aggregation, application of the methodology, securing funding, analysing the results, and writing the paper and code. E.S.F. contributed in writing the paper, securing funding and providing important feedback and guidance. F.D. contributed in writing the paper, developing the methodology, and providing important feedback. A.M. and K.M. contributed towards the ideation and methodology development, in providing supervisory guidance throughout the paper, securing and providing funding, and providing important feedback in the editing process.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Varghese, A., SantosFernandez, E., Denti, F. et al. A global perspective on the intrinsic dimensionality of COVID19 data. Sci Rep 13, 9761 (2023). https://doi.org/10.1038/s41598023361161
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598023361161
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.