Scholars interested in cultural diversity have long suggested that similarities and differences across human populations might be understood, at least in part, as stemming from differences in the social and physical ecologies individuals inhabit. Here, we describe the EcoCultural Dataset (ECD), the most comprehensive compilation to date of country-level ecological and cultural variables around the globe. ECD covers 220 countries, 9 ecological variables operationalized by 11 statistical metrics (including measures of variability and predictability), and 72 cultural variables (including values, personality traits, fundamental social motives, subjective well-being, tightness-looseness, indices of corruption, social capital, and gender inequality). This rich dataset can be used to identify novel relationships between ecological and cultural variables, to assess the overall relationship between ecology and culture, to explore the consequences of interactions between different ecological variables, and to construct new indices of cultural distance.
culture • ecology
Background & Summary
Cataloguing and explaining human cultural diversity has been a core question for many branches of the social sciences. Many evolutionary social scientists and behavioral ecologists posit that the patterns of cultural diversity observed around the globe are due, at least in part, to the physical and social ecologies individuals inhabit1,2,3. For example, in societies with a high pathogen prevalence, cultures tend to be more collectivistic—this may be a strategy for dealing with high rates of disease through behaviors that protect the individual and the group from the threat of outside pathogens4,5; higher levels of population density have been linked to lower rates of fertility6,7, which may reflect an adaptive shift toward slower life history strategies in the face of stiffer social competition; and societies that are afflicted with high extrinsic mortality threats tend to have stronger social norms, which may increase the likelihood of survival in such conditions8.
Although investigations of the interplay between ecology and culture have been fruitful, most have focused on links between a single ecological variable (e.g., pathogen prevalence, population density, resources levels) and a single cultural outcome or small set of such outcomes (e.g., individualism-collectivism, tightness-looseness, the Big Five personality traits). Additionally, with some exceptions7,9,10, such investigations have focused on levels of those ecological variables at a single time point.
Here, we present the EcoCultural Dataset (ECD)11, which aims to address these limitations and spur future discoveries. The ECD is a compilation of data from 220 countries on nine ecological variables and 72 cultural variables that are likely to be of broad interest to social scientists. The country-level data in the ECD complements other comprehensive data sources such as D-PLACE, which focuses on small-scale societies12. The ECD includes time series data on ecological variables and 11 statistical metrics for each, designed to index properties such as historical averages, variability across time, predictability, and extreme perturbations.
Data were collected on the level of countries. In constructing the ECD11, we selected only variables that contained data from a minimum of twenty countries.
Although conceptualizations of ecology vary across disciplines13, they generally share an emphasis on the relationships between organisms and their external environment14 Our conceptualization of ecology is largely grounded in behavioral ecology15 and extends beyond the physical environment to include key features of the social environment as well which have been linked to adaptive responses across species, such as population density and resource availability2. This conceptualization of ecological variables encompasses environmental conditions that have direct implications for an individual’s survival and reproduction, which includes not only aspects of the natural environment related to climate, but also factors like the availability and distribution of resources key to biological fitness.
We required that all ecological variables contain at least 20 time points for some number of countries to conduct a sufficiently powered time series analysis using the ARIMA model. We gathered data on nine different ecological variables: rainfall, temperature, GDP per capita, income inequality, external mortality, life expectancy, disease threat, population density, and unemployment rates (Table 1). Although these variables capture a wide range of features of the physical and social environment that may have consequences for human cultural variation, we do not argue that this is an exhaustive set of variables which could be considered ecological.
Cultural variables were identified through reviews of the psychological literature and by crowdsourcing on social media and Listservs from the psychology community. To be included in the ECD, cultural variables needed (a) to be indices (not single-item measures) and (b) to contain data for at least 20 countries. In total, this search yielded 72 cultural variables including values, personality traits, motivations, social norms, subjective well-being, innovation, and government functioning (see data for full list).
Data usage and permissions
Ecological variables were collected from the World Bank, World Health Organization, Institute for Health Metrics and Evaluation, and scholarly publications. These sources all permit the use of their data under the CC Attribution 4.0 International License. Full sources are listed in Table 1.
Cultural variables were collected from academic publications or publications by NGOs, intergovernmental organizations, or other public facing data published online, all of which is permitted for use with attribution. The sources of all cultural variables used in the ECD can be found within the datafiles on OSF (https://osf.io/r9msf/).
In addition to raw ecological and cultural data, the ECD11 includes 11 statistical metrics for each ecological variable (on the level of countries) to allow researchers to explore the relationship between time-variant features of ecological conditions and present-day cultural variation: current levels, mean across time, extreme perturbations (maximum, minimum), as well as indicators of trends (standardized linear regression coefficient), variability across time (standard deviation, range, percentage of outliers), and predictability (Mean Absolute Percent Error, Mean Absolute Standard Error, first-order autocorrelation). We calculate these metrics for years ranging from 1950 to 2020.
Current Level: Datapoint for the year of publication for the corresponding cultural variable, or the most recent available data point.
Mean Level: Arithmetic average of all available datapoints.
Standard Deviation (Variability): Average deviation from the mean.
Range (Variability): Difference between the maximum and minimum values.
Maximum Value (Extreme Perturbation): Highest datapoint in the dataset.
Minimum Value (Extreme Perturbation): Lowest datapoint in the dataset.
Mean Absolute Percentage Error (MAPE; Predictability): Derived from auto.ARIMA and a train-test procedure (see below), an accuracy measure defined as the absolute, average percent deviation between the actual value in a time series and the forecasted value from an ARIMA model. Higher percentages indicate more error, or less predictability.
Mean Absolute Scaled Error (MASE; Predictability): Derived from auto.ARIMA and a train-test procedure (see below), an accuracy measure defined as the ratio of errors made by an ARIMA model relative to a naïve forecast. Higher values indicate more error, or less predictability.
First-Order Autocorrelation (Predictability): Correlation between successive residuals in a time series, with greater values indicating a high degree of relatedness between time, tn, and the successive time point, tn+1. Lower values indicate less similarity between time points, or less predictability.
Percent of Outliers (Extreme Perturbation): Percent of datapoints in a time series that deviate more than 2.5 standard deviations from the mean.
Standardized Beta Coefficient (β): Measure of the linear relationship between time and the ecological variable, derived from a linear regression analysis where the data have been standardized such that the standard deviation of the ecological and cultural variables equal 1 and their respective means equal 0. Lower values indicate a less gradual linear increase (or decrease, for negative values) over time.
Thus, we calculated 99 different estimates of ecology for each country for which data was available—eleven metrics for each of the nine ecological variables (current levels of rainfall in Ukraine, mean rainfall in Ukraine, etc.).
MAPE and MASE values were calculated using the auto.ARIMA function from the forecast package in R16, a machine learning algorithm that fits models with various AutoRegressive Integrated Moving Average (ARIMA) parameters to a time series dataset and selects the optimal model based on fit. We used a two-step train-test procedure. In the first step, auto.ARIMA was used to fit a model based on 80% of datapoints. In the second step, that model was fit to the held-out portion of 20% of datapoints. We gathered three measures of predictability from these analyses: MAPE and MASE (where predictability for a given ecological variable was operationalized as the amount of error), and first-order autocorrelation.
ECD Codebook: Contains the meta-data for all ecological and cultural variables contained within the dataset, including source, whether the variable is a part of a larger taxonomy (e.g., Big Five, Moral Foundations), sample size when available, and variable type (e.g., continuous/discrete, raw/transformed).
ECD Data: Contains raw country-level data for nine ecological variables, 72 cultural variables, operationalizations (calculated based on all the time series data including and proceeding that year), and geographic meta-data (latitude, longitude, World Health Organization world region). Because this file presents time series data, it is organized in the “long-format,” such that every row represents a country and year, with columns representing specific ecological variables. If calculating correlations between an ecological variable and a cultural variable using time series data, it is important to truncate the time series of the ecological variable to before the data collection of the cultural variable. For example, if correlating mean rainfall and Agreeableness, calculate a country’s mean rainfall from the first available date to 2007, which is the year of publication for the Agreeableness variable, not 2019, the last available year of data for rainfall.
The specific measures used to calculate the variables are available in their original source material. The scales for the ecological variables are available in Table 1. Missing data are entered as “NA”. In the case of the “Codebook”, NA under “Sample Size” indicates that there was no sampling (in the case of variables like “rainfall” or “Human Development Index”) or that the exact sample size is not provided (in the case of the Hofstede variables).
It is worth noting that spatial and temporal autocorrelation are issues that researchers may encounter when using this dataset. Countries in geographical proximity should not be considered as independent datapoints, due to high probability of shared ancestral history and horizontal cultural transmission17,18. Further, ecological data from two consecutive years are likely highly correlated given that it is rare for ecological conditions to drastically change from year to year and our ecological data is often averaged or collected at a single time point within a year. Spatial autocorrelation can be addressed in many ways, including by conducting analyses within world regions (to control for shared cultural history) or by using statistical approaches such as autocovariate models19. Temporal autocorrelation can be addressed through various methods for detrending time series data such as differencing or residualizing out linear trends and autoregressive components10,20.
ECD can be used to explore theoretically and methodologically important questions about the ecology and human cultural universality and diversity. By analyzing these data on their own or in concert with other sources of data, researchers can explore questions including:
How much of human cultural variation around the globe is explained by ecology?21
How might interactions between different ecological variables and/or their statistical metrics be linked to both specific cultural variables and patterns of cultural variation in general?
Are historical ecological conditions or current ecological conditions more closely related to cultural variation?21
Do different ecological metrics (such as predictability versus current levels) have qualitatively different linkages to cultural diversity?21
How might interactions between different metrics of the same ecological variable be linked to cultural variation?
How might interactions between certain ecological variables (or their statistical metrics) and cultural variables be linked to cultural variation?
Do societies cluster in a similar or different fashion based on different cultural variables?
How might ecological similarity between home and host society predict relative ease or difficulty of acculturation?
Illustrative exploratory analysis
To illustrate one promising way in which these data can be used, we explore—visually, using dendrograms—how countries cluster based on cultural and ecological variables. Dendrograms are depictions of hierarchical clustering, a statistical method for grouping similar observations within a dataset across multiple hierarchically nested levels. The degree of similarity between two dendrograms can be assessed using Baker’s gamma—a rank correlation coefficient with corrections for non-independence of observations22, or rather, whether or not countries cluster at the same hierarchical level across two dendrograms. As with a Pearson coefficient, Baker’s gamma values range from −1 (highly dissimilar) to 1 (highly similar). We note that although dendrograms have been used in prior cross-cultural work to explore clustering of countries23,24,25, this approach is primarily used in an exploratory fashion, as is the case in the present work. Thus the present analyses maybe better thought of as a jumping off point for further investigation rather than definitive findings.
We began by building a dendrogram using Schwartz’ cultural values: harmony, embeddedness, hierarchy, mastery, affective autonomy, intellectual autonomy, and egalitarianism26 (Fig. 1). Next, we built a series of dendrograms representing the eleven different metrics (in 2007, the year of publication for the Schwartz’ cultural values data) for all nine ecological variables (Figs. 2–4).
Next, we explored the clustering based on Schwartz’ values and current levels of ecology, as the latter metrics are most commonly used in cross-cultural research. In the dendrogram based on Schwartz’ cultural values, many Western European countries are grouped together (blue cluster) and are distinct from the clusters which contain the Eastern European countries (purple and green,) which is broadly consistent with previous research showing regional variation in values within Europe27.
However, there are some clusters in the dendrogram based on Schwartz’ values that suggest cultural similarities which are not based on geographic proximity or other commonly used ways of grouping countries—such as the purple cluster, which contains Romania, Venezuela, Japan, Brazil, and the United States.
Why might this be? Examining the current levels of ecology dendrogram provides some insight into this unintuitive finding. For example, Brazil and the United States cluster closely together on the dendrograms for both Schwartz’ cultural values and their overall current social and physical ecology. Additionally, these dendrograms point to interesting avenues for future research by highlighting cases where ecological and cultural similarities diverge. For example, South Korea and Spain are in distinctly different cultural clusters but appear to have highly similar ecologies. Thus, one avenue for future research might be to investigate which factors moderate links between ecological similarity and cultural similarity.
Comparing the dendrogram built from current levels of ecology and Schwartz’ cultural values yielded a Baker’s gamma of 0.26—suggesting a moderate, positive relationship between the two. However, not all metrics of ecology show the same relationship. The Baker’s gamma comparing the Schwartz’ dendrogram to the one for MAPE (a metric of ecological unpredictability) was smaller: -0.04 (Fig. 3); for standard deviation it was larger: 0.30 (Fig. 4)—suggesting a greater degree of statistical similarity in the clustering of countries based on ecological variability and cultural values. This range of Baker’s gammas suggests that the strength of similarity between ecology and culture can vary considerably based on which metric is used to measure ecology.
All calculations were conducted in R28 (version 4.2.0), using the forecast, psych, foreign, jtools, and lmtest packages16,29,30,31,32. The R code used to aggregate and calculate the statistical metrics of ecology is available on OSF at https://osf.io/r9msf/. Code for the dendrogram example is also available on OSF and uses the circlize and dendextend packages33,34.
Tooby, J. & Cosmides, L. The Psychological Foundations of Culture. in The Adapted Mind: Evolutionary Psychology and the Generation of Culture (eds. Barkow, J. & Williams, G.) (1992).
Sng, O., Neuberg, S. L., Varnum, M. E. W. & Kenrick, D. T. The behavioral ecology of cultural psychological variation. Psychol Rev 125, 714–743 (2018).
Steward, J. The Concept and Method of Cultural Ecology. in The environment in anthropology: a reader in ecology, culture, and sustainable living (eds. Haenn, N. & Wilk, R. R.) 12–17 (New York University Press, 2006).
Fincher, C. L., Thornhill, R., Murray, D. R. & Schaller, M. Pathogen prevalence predicts human cross-cultural variability in individualism/collectivism. Proc. R. Soc. B 275, 1279–1285 (2008).
Na, J. et al. Individualism-collectivism during the COVID-19 pandemic: A field study testing the pathogen stress hypothesis of individualism-collectivism in Korea. Pers Indiv Differ 183 (2021).
Sng, O., Neuberg, S. L., Varnum, M. E. W. & Kenrick, D. T. The crowded life is a slow life: Population density and life history strategy. Journal of Personality and Social Psychology 112, 736–754 (2017).
Rotella, A., Varnum, M. E. W., Sng, O. & Grossmann, I. Increasing population densities predict decreasing fertility rates over time: A 174-nation investigation. American Psychologist 76, 933–946 (2021).
Gelfand, M. J. et al. Differences Between Tight and Loose Cultures: A 33-Nation Study. Science 332, 1100–1104 (2011).
Santos, H. C., Varnum, M. E. & Grossmann, I. Global increases in individualism. Psychological Science 28, 1228–1239 (2017).
Jackson, J. C., Gelfand, M., De, S. & Fox, A. The loosening of American culture over 200 years is associated with a creativity–order trade-off. Nature Human Behaviour 3, 244–250 (2019).
Wormley, A. S., Kwon, J. Y., Barlev, M. & Varnum, M. E. W. An EcoCultural Dataset for Investigating Cultural Variation. Open Science Forum https://doi.org/10.17605/OSF.IO/R9MSF (2022).
Kirby, K. R. et al. D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLoS ONE 11, e0158391 (2016).
Taylor, W. P. What is Ecology and What Good is It? Ecology 17, 333–346 (1936).
Friederichs, K. A Definition of Ecology and Some Thoughts About Basic Concepts. Ecology 39, 154–159 (1958).
Davies, N. B., Krebs, J. R. & West, S. A. An introduction to behavioural ecology. (John Wiley & Sons, 2012).
Hyndman, R. et al. forecast: Forecasting Functions for Time Series and Linear Models. (2021).
Koenig, W. D. Spatial autocorrelation of ecological phenomena. Trends in Ecology & Evolution 14, 22–26 (1999).
Dobson, P. & Gelade, G. A. Exploring the Roots of Culture Using Spatial Autocorrelation. Cross-Cultural Research 46, 160–187 (2012).
Dormann, C. F. et al. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30, 609–628 (2007).
Jebb, A. T., Tay, L., Wang, W. & Huang, Q. Time series analysis for psychological research: examining and forecasting change. Front Psychol 6 (2015).
Wormley, A., Kwon, J. Y., Barlev, M. & Varnum, M. E. W. Ecology Explains a Substantial Amount of Human Cultural Variation Around the Globe. Preprint at https://doi.org/10.31234/osf.io/84xjg (2022).
Baker, F. B. Stability of two hierarchical grouping techniques case I: sensitivity to data errors. J Am Stat Assoc 69, 440–445 (1974).
Awad, E. et al. The Moral Machine experiment. Nature 563, 59–64 (2018).
Pick, C. M. et al. Fundamental social motives measured across forty-two cultures in two waves. Sci Data 9, 1–12 (2022).
Obradovich, N. et al. Expanding the measurement of culture with a sample of two billion humans. Journal of the Royal Society Interface 19, 20220085 (2022).
Schwartz, S. H. Value Orientations: Measurement, Antecedents and Consequences Across Nations. in Measuring Attitudes Cross-Nationally 169–203, https://doi.org/10.4135/9781849209458.n9 (SAGE Publications, Ltd, 2007).
Inglehart, R. & Baker, W. E. Modernization, Cultural Change, and the Persistence of Traditional Values. Am Sociol Rev 65, 19–51 (2000).
R Core Team. R: A Language and Environment for Statistical Computing. (2022).
Revelle, W. psych: Procedures for Psychological, Psychometric, and Personality Research. (2022).
R Core Team et al. foreign: Read Data Stored by ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, ‘dBase’,… (2022).
Long, J. A. jtools: Analysis and Presentation of Social Scientific Data. (2022).
Hothorn, T. et al. lmtest: Testing Linear Regression Models. (2022).
Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. Circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014).
Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31, 3718–3720 (2015).
World Bank Group. Climate Change Knowledge Portal. Climate Change Knowledge Portal https://climateknowledgeportal.worldbank.org/download-data# (2020).
The World Bank. GDP per capita (current US$). The World Bank https://data.worldbank.org/indicator/NY.GDP.PCAP.CD (2019).
Solt, F. Measuring Income Inequality Across Countries and Over Time: The Standardized World Income Inequality Database. Social Science Quarterly 101, 1183–1199 (2020).
World Health Organization. WHO Mortality Database. https://www.who.int/data/data-collection-tools/who-mortality-database (2020).
United Nations Population Division. Life expectancy at birth, total (years). The World Bank https://data.worldbank.org/indicator/SP.DYN.LE00.IN (2019).
Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2019 (GBD 2019) Results. Institute for Health Metrics and Evaluation (IHME) http://ghdx.healthdata.org/gbd-results-tool (2020).
International Labour Organization. Unemployment, total (% of total labor force). World Bank https://data.worldbank.org/indicator/SL.UEM.TOTL.ZS (2020).
This work is supported by the National Science Foundation’s Graduate Research Fellowship Program (ASW). Thank you to our research assistants-- GA, DG, EC, JG, and SS-- for their assistance with data collection.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Wormley, A.S., Kwon, J.Y., Barlev, M. et al. The Ecology-Culture Dataset: A new resource for investigating cultural variation. Sci Data 9, 615 (2022). https://doi.org/10.1038/s41597-022-01738-z