Background & Summary

Climate change and biodiversity loss are inextricably linked, sharing common drivers in human activities, and threatening human wellbeing and the possibility of achieving the Global Goals of Agenda 20301. For example, climate- and biodiversity-related Sustainable Development Goals (SDGs) affect most of the other SDGs and are affected by a number of them – hence understanding synergies and trade-offs between the SDGs becomes crucial in attaining these goals overall2. However global policies such as those addressing biodiversity loss (e.g. the Strategic Plan for Biodiversity 2011–2020), climate change (Paris Agreement 2015), and sustainable development (Agenda 2030) pay insufficient attention to these linkages and trade-offs. There are considerable debates about how best to tackle climate change and biodiversity loss, often without due consideration to human wellbeing. Calls such as the ‘Half-Earth’ – reserving half of the planet for nature conservation3,4,5, or protecting 30% of the planet by 2030 fail to consider the potential costs to millions of people dependent on nature for their livelihoods6,7 or the fact that indigenous and forest-dependent local communities already protect and sustainably manage millions of hectares of forests worldwide often with greater success than traditional protected areas8,9.

Within this context, understanding the broader contribution of uncultivated (wild or less disturbed) ecosystems to the wellbeing of communities that use them directly remains relevant. A vast literature based on small sample studies exists exploring resource dependency10,11,12; natural resource management and governance13,14,15,16,17; gender18,19,20; markets21,22,23 among others. However, systematic pooled data analyses across multiple sites and countries are harder to find. Of the few exceptions, Centre for International Forestry Research (CIFOR)’s NTFP case comparison project focused specifically on non-timber forest products, livelihoods and conservation using a pool of 61 cases across Africa, Asia and Latin America24; Angelsen et al.25 used the resulting database covering 24 tropical countries for a global comparative analysis of environmental income and rural livelihoods; and Jagger et al.26 used the same database alongside RRI’s global forest tenure data to analyse tenure and forest income globally.

The Ecosystem Services for Poverty Alleviation (ESPA) Programme provided a common framework for several ambitious interdisciplinary projects to investigate the relationship between ecosystems and multidimensional wellbeing and/or poverty in countries throughout the global south between 2013 and 201927. Building on these studies and supplementing them with comparable datasets from other research programmes, we have drawn on 11 datasets comprising 232 settlements and 10,971 households in ten low- and middle-income countries (Table 1), representing forest, savanna and coastal ecosystems, to analyse (i) how uncultivated nature contributes to multi-dimensional wellbeing, (ii) how benefits are distributed between households, and (iii) how governance (in the form of tenure regimes) mediates the relationship between uncultivated nature and wellbeing. These 11 datasets were selected for two key reasons: (1) they all collected similar data on ecosystem services and multidimensional human wellbeing using comparable surveys; (2) they all collected data within the same time period (2011–2015) making it easier to produce comparable estimates of wellbeing indicators.

Table 1 Overview of the original data included in the dataset.

This dataset was compiled for analyses in the research project ‘Nature’s contribution to poverty alleviation, human wellbeing and the SDGs’ (Nature4SDGs) (NERC Grant NE/S012850/1) and is now publicly archived with ReShare (UK Data Service)28. The dataset integrates secondary data on rural livelihoods, multi-dimensional human wellbeing, household demographics, resource tenure, and social-ecological context, thus focusing particularly on SDGs 1 (no poverty), 2 (zero hunger), 10 (reduced inequalities) and 15 (life on land). It primarily draws upon nine existing household surveys, and their associated site descriptions and qualitative interviews which informed and contextualised the surveys. It also draws upon existing global datasets on travel time to cities, population density, local area GDP, land use and land cover. Using this dataset, the Nature4SDGs project is specifically examining multidimensional wellbeing from the use of uncultivated nature; the role of common pool uncultivated resources in reducing income inequalities; and the consumption of wild protein across different social-ecological contexts. This dataset will potentially allow further analyses on nature-wellbeing relationships in future on its own or in combination with similar datasets.

Methods

The integrated dataset was developed in four main steps: (1) data identification; (2) collection of complementary data; (3) data processing; and (4) organization and publication of the dataset (Fig. 1).

Fig. 1
figure 1

Summary of the steps and methods of creating the dataset.

Data identification

The original household surveys were conducted during nine different research projects (see Table 1 and Figs. 1, 2); seven of those funded by the Ecosystem Services for Poverty Alleviation (ESPA) programme (https://www.espa.ac.uk), which ran from 2009 to 201829. Each project had a different research focus, but all collected similar data on ecosystem services, livelihoods, and human wellbeing. All surveys collected information on livelihoods for a period between 2011 to 2015 and generated a variety of qualitative information from site descriptions and interviews. Owing to different sampling strategies in original surveys, while the sample of households is representative of each settlement, special attention needs to be paid when thinking about the population of settlements that this integrated dataset (or subsets of it) may represent. Full details of sampling strategies are available in the documentation of the original datasets30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56.

Fig. 2
figure 2

Locations of original sites of data collection included in this dataset.

Collection of complementary data

To supplement the household surveys, we also collated several global datasets to generate information on the social-ecological context of each settlement as close to the time period covering original data collection as possible (i.e. 2011–2015), including the travel time to cities57, population density58, estimated local area GDP in USD59, and land use land cover information60,61,62. This information was generated using data on the spatial location of the settlements, which is confidential and not included in this integrated dataset.

Data processing

The dataset consists of four main components: Settlement-level resource tenure systems, settlement-level social-ecological context, household demographics and wellbeing, and household livelihood sources (Fig. 3). Below follows a description of the methods used to generate each component.

Fig. 3
figure 3

Summary of dataset structure and sources. Each column represents one of the archived csv files, and all variables are described fully in archived data documentation files28.

Settlement-level resource tenure systems

Our approach to identifying and mapping resource tenure prevalent at the site level followed existing frameworks on social-ecological systems63,64, tenure, and property rights65,66,67, together with the site/project-specific information on the types of resources available to the households and the de facto and de jure rights to those resources. We contextualised the settlement-level resource systems around the ‘service shed(s)’ idea68 and mapped key categories of rights to the identified resource types, focusing specifically on accurately mapping rights in practice (de facto) at all settlements based on the available qualitative data (participatory maps, land use change, resource trends) and household survey information (land holdings, access to resources).

This component of the dataset includes the variables resource sector, resource type, and de facto tenure. Sector is a higher-level classification of resources, similar to ‘RS1 Sector’ in Ostrom63, and includes categories such as agriculture, forest (natural), silviculture, fisheries and built environment. Each sector can include multiple resource types, which relate to primary use or purpose or some defining characteristics of the resource, including limitations. For example, farmland, agroforestry plots and home-gardens all fall under the ‘agriculture’ sector. Each resource type could be characterised as either ‘cultivated’ or ‘uncultivated’. The term tenure encompasses not just property rights but also wider institutions (such as who makes constitutional rules, who makes operational rules) within which resource use is embedded. While defining higher-level tenure types of resources, we limited ourselves to property rights. All common-pool resources comprise a class of goods that shares two attributes of importance for economic activities: (1) it is costly to exclude individuals from using the good either through physical barriers or legal instruments, and (2) benefits consumed by one individual subtract from the benefits available to others69. Schlager and Ostrom65 identify five property rights that are most relevant for use of common-pool resources: “access, withdrawal, management, exclusion, and alienation”65. For our resource tenure mapping exercise, particularly for uncultivated common pool resources, we are only looking at de facto access and withdrawal rights from those resource types. At the higher level, we have defined four resource tenure categories based on de facto rights to the resource type defined earlier. Higher-level tenure classification broadly corresponds to: (1) Privately owned resources (primarily cultivated); (2) Community-managed resources (sub-categorised into regulated or unregulated at the community level); (3) Protected area (sub-categorised into strictly protected, regulated access or de facto open access); and (4) Open access (unregulated public access).

Settlement-level social-ecological context

The variables included in this component of the dataset aim to describe the broader social and ecological context at the settlement level. The social variables are focused on the degree of market access in the village and population density. These variables were generated either directly from the surveys, or from existing spatiotemporal global datasets on travel time to cities57, population density58 and estimated local area GDP in USD59. The ecological variables are focused on characterising the land use land cover60,61,62 around each settlement, the associated de facto tenure regimes, as well as the relative extent of ‘wild’ or ‘uncultivated’ areas. All ecological variables were generated from existing spatiotemporal datasets.

We generated the variables for each settlement according to the following steps. Following established travel time-based methods for estimating elements of social-ecological landscape structure70, we first used information from local site descriptions and interviews to estimate the typical distance of one day’s return travel to a village in each site, based on common transport options and local geography. We then used this to calculate the diameter of a circular ‘landscape buffer’ within which to summarise settlement-level variables (varying from 3 km to 10 km in diameter for each village; see the methods summary accompanying the dataset for details). Next, we used land use land cover products and expert knowledge to generate estimates of the extent of different resource tenure systems within this buffer (see previous subsection for details on resource tenure system classifications), including the extent of uncultivated land covers. For example, within the landscape buffer around a village, the relative extent of different tenure classes within a landscape buffer could comprise 70% private land, 10% community regulated land and 20% protected area land with regulated access. In this example, if the community regulated and protected area were also uncultivated, the total uncultivated land within the buffer would be 30%. While the use of circular buffers are only approximate representations of actual resource catchment boundaries, our assumption is that the landscape metrics generated within these boundaries are correlated to some degree with characteristics of the underlying resource catchments.

Household demographics and wellbeing

This component of the dataset was generated in two parts; one on household demographics and one on multidimensional human wellbeing. For the part on household demographics, we used the household surveys to generate household-level variables on age, labour profile, socio-cultural grouping, and land ownership. Variables on age (dependency) and labour profile will affect which livelihood strategies a household can engage in, and the (per capita) benefit from these livelihoods within the household. Social capital and grouping variables are indicators of social difference that may indicate something about the way that they can interact with resource governance and other institutions. Variables on land ownership and tenure type provide household-level information on tenure, to complement the settlement-level ‘Resource Tenure System’ dataset. All variables were generated by importing each of the original household surveys into R71, then pulling relevant variables together into a common structure (mainly using the dplyr72 package). However, not all variables were available across all sites because they were not collected in some of the original datasets. The technical validation section describes how we dealt with missingness and data equivalence.

For the second part on household multidimensional wellbeing, we used the household surveys to generate a series of (mainly) binary household-level variables on different dimensions of human wellbeing. In doing so we sought to balance the need for detailed and locally contextualised measures with the need for cross-site comparability. Broadly, we defined wellbeing as having three dimensions (material, subjective and relational)73, and framed our measurement approaches based on the associated environment and development literature50, including subjective wellbeing74, human needs75, wellbeing in developing countries76, capabilities77, and relational wellbeing78,79. Definitions of these concepts vary80. As a starting point, Coulthard et al.81 provide a useful summary of HWB as comprising: “[…] a material dimension that emphasizes the objective resources a person has access to; a relational dimension that considers how social relationships influence what people can (or cannot) do; and a subjective dimension that takes into account a person’s level of satisfaction with the quality of life they achieve” (p. 299). Within this framing ‘basic needs’ approaches aim to understand if people are deprived in different subdimensions of material, subjective and relational wellbeing (e.g. health, education, shelter, life satisfaction, social relations etc.)50.

To develop standardised measures for these different subdimensions, we adapted the methods used to generate the Oxford Poverty & Human Development Initiative’s (OPHI) multidimensional poverty indicator (MPI)82. The MPI approach is grounded in Amartya Sen’s capabilities approach77 and is based on the ‘counting’ of different basic needs (or deprivations) that are met (or unmet) within a household83,84. It is already widely used to combine diverse data on wellbeing from different surveys, sites and countries33,85,86. Typically the MPI is generated in two steps. First, for each indicator of a basic need (e.g. years of schooling) a cut-off is applied (e.g. <6 years), below which a household is deprived for that dimension (i.e. it transforms each observed indicator into a deprived/not deprived binary variable). This cutoff value can differ between datasets depending on locally contextualised cutoffs and serves as a method of cross-dataset standardisation. In the second step, these binary indicators are added together to form a (weighted) multi-level ordinal index of relative aggregate deprivation across all households. A further cutoff is then applied to this variable to determine if a household is ‘poor’ (e.g. at least 1/3 of basic needs not met).

In generating this dataset, we used the MPI approach as a starting point and made two adaptations. First, in order to maintain the richness of our wellbeing data, we only implemented the first step to provide a multivariate set of binary variables of deprivation in different basic needs. These could be combined into a weighted ordinal variable or a binary MPI if needed. Second, in addition to the material basic needs usually measured in the MPI, we added further material dimensions (protein consumption, productive assets), as well as subjective (life satisfaction) and relational dimensions (institutions, autonomy)74,87. Most variables from the original surveys were transformed into binary indicators according to common thresholds in the latest MPI88 and literature associated with the original datasets33,50,89,90. The only exception is the life satisfaction variable, which was sufficiently similar between datasets that we could transform it into a four-level ordinal variable, and in doing so preserve more information on this dimension. All variables were generated in R using the dplyr package. A lower score means lower wellbeing for that dimension. Again, not all variables are available across all sites because they were not collected in some of the original datasets. We did not include a poverty line or wealth rank. These can be generated from the income information in the ‘Household Livelihoods’ dataset, and/or from the assets information in the original surveys.

Household livelihood sources

Broadly, the concept of livelihood includes economic as well as non-economic attributes of making a living. Apart from income, it covers the “social relationships and institutions that mediate people’s access to different assets and income streams” (p. 291)91. The second component of the dataset focuses on the economic aspect of livelihoods, i.e. income of households. To estimate household income, we prepared a dataset containing two types of livelihoods: harvests from cultivated (e.g. farms, aquaculture) and uncultivated (e.g. forests, fisheries) sources; and cash income from employment, businesses etc. For harvests, we generated information on the annual quantity and value of harvest of each product collected by each household and shares of harvests used for subsistence and sale. Cash income includes each household’s earnings from non-farm businesses and other sources, wage income and remittances. To derive the value of the harvest, we used the market price of each product, including for harvests that were not traded (e.g. subsistence harvests). To tackle the problem of missing price, we assigned the median price of a product calculated at the village level. Aggregation of harvest values and other earnings give the total household gross income. We then converted the gross household income into USD using World Bank indices of Purchasing Power Parity (PPP) in the year of the survey92.

Complete livelihoods were available for all original datasets except for PIMA (Tanzania) and PEFESPA (India). The PIMA survey gives the share of the harvest, instead of absolute quantities. Therefore, we are not able to estimate the gross value of harvest, only the proportion. The livelihood data of PEFESPA is incomplete in the sense that it has data on income from forest ecosystems but does not have data on income from the non-farm sector, wage income and remittances.

Organisation and publication of the dataset

The integrated dataset is comprised of four components: (1) Household demographics and wellbeing, where each row is a household; (2) Household livelihood sources, where each row is a livelihood source associated with a particular household; (3) Settlement-level resource tenure systems, where each row is a natural resource associated with a settlement; (4) Settlement-level social-ecological context, where each row is a settlement. The four components can be linked together through corresponding ID variables (in a relational database structure). Figure 3 summarises the variables, ID links, and source data of each of the components. Some variables are not present across all sites. The implications of this missingness are discussed below in Technical Validation Section.

Ethical approval

The integrated dataset described in this paper was produced under the Nature4SDGs project with full approval from the King’s College London’s Research Ethics Office on the ‘Use of Secondary Data in Nature4SDGs Project’ (KCL Ethics Ref: LRS-19/20-14886). As a general principle, all household level identifying information were already removed from the original datasets, as well as other village-level identifying information in some cases. All spatial data are kept confidential as per the original research ethics approval in the contributing datasets.

Data Records

The integrated dataset is available through the ReShare repository28 and is comprised of four components, compiled as four separate .csv files: (1) Household demographics and wellbeing (n4s_hh.csv), where each row is a household; (2) Household livelihood sources (n4s_lvl.csv), where each row is a livelihood source associated with a particular household; (3) Settlement-level resource tenure systems (n4s_rts.csv), where each row is a natural resource associated with a settlement; and (4) Settlement-level social-ecological context (n4s_setts.csv), where each row is a settlement.

The four components can be linked together through corresponding ID variables (in a relational database structure). We also provide a further three csv files to help navigate and use the dataset: (1) n4s_ids.csv: to help link between csv files, a file with all corresponding IDs, including the IDs from the original source datasets (should you want to link back to the original source databases); (2) n4s_variable_names.csv: a csv file with all variable names and descriptions; and (3) n4s_ls.csv: which contains more detailed information on the extent, proportion and area of the different land covers and associated resource tenure systems in each settlement.

Technical Validation

Missingness, imputation, and data equivalence

While the original surveys shared a focus on ecosystem services and wellbeing, some of the variables differed slightly between the surveys. Additionally, each original survey has its own strengths and weaknesses in data quality (e.g. some have very robust livelihoods data, while others do not). In using this combined dataset, special consideration is therefore needed of missingness, as well as data quality and equivalence, between original datasets.

For data missingness, there are two types: (1) ‘Real’ missingness where, while a variable may have been collected in the original survey, there are occasionally a limited number of household values missing. This is due to more traditional issues around non-responses or enumerator errors in the original survey; and (2) ‘Question absence’ missingness, where the variable was not collected for an entire village or region in the original surveys.

Real missingness is coded as −9 in the dataset, while question absence is coded as −8.

In the published dataset, we have avoided imputing missing data for all variables except for harvest value (h_val) in the livelihoods data frame (n4s_lvl). For all other variables, missingness can be imputed from the raw data provided if users so wish. For harvest value, robust imputation requires site-specific knowledge on similarities between villages and harvest unit equivalence. We thus undertook to impute missing h_val values for the combined dataset.

Missingness of the harvest value data primarily occurred because respondents could not provide values for every harvest in every household (e.g. because some households do not trade every type of harvest). We thus used a hierarchical strategy to impute best estimates of harvest value where it was missing. In order of preference: (1) Where other households in the village had reported a harvest value for the same livelihood source (lvl_source) and unit (h_unit), we took the median of the harvest in the village; (2) Where the above was not possible, we used field notes on h_unit equivalences to impute the within-village median value of the same lvl_source with different units; (3) Where the above was not possible, we extended imputation to include median values from nearby villages that according to site experts were qualitatively similar in their socio-ecological context; and (4) Where the above was not possible, we used field notes and expert opinion from local site experts to estimate harvest values.

While we have sought to generate common variables across the original surveys, the varying origin of these variables means that data quality and equivalence need to be carefully considered when designing an analysis. In Section 4 of the data documentation28 we highlight particular issues with data quality and equivalence for each part of the integrated dataset. Generally, in any one analysis particular sites can be assumed to have more robust variables for particular constructs. Analyses can deal with this by either focusing only on sites with high quality variables, or by running multiple models (e.g. seeing if there are differences between one model with high quality sites only, and a second with all sites).

Expert validation

During and after the generation of the integrated dataset, we validated variables for each site through an iterative process of data checking with one or more experts for each site who were involved in the original data collection. This included: generating and reviewing summary statistics of variables; reviewing apparent outliers and other anomalies; RTS and LULC map validation interviews, where we interactively viewed land cover maps and RTS for each settlement, reflected on their accuracy, and collaboratively linked RTS to specific land cover classes. This validation process was also facilitated through the testing of the dataset in four forthcoming analyses, during which remaining anomalies in the dataset were identified and checked. The validation process led to the exclusion of particular variables in a few settlements due to data quality issues in the original datasets, and to the refinement of RTS, livelihood and wellbeing classifications applied during our generation of the Nature4SDGs dataset.

Usage Notes

We have done our utmost to ensure a robust standardised dataset. However, given the diversity and complexity of the original surveys, users need to be critical about data equivalence between sites and underlying biases in the sample of settlements. We thus strongly encourage users to consult the original documentation for each original dataset (referenced above) before using and drawing conclusions from the data for particular sites. We would also encourage users to consult the authors of this data paper on the use of data, who have a wealth of expert knowledge about the original datasets and specific knowledge of the sites. We highlight some specific usage notes below:

Household demographics and wellbeing

  • Some of the common variables on household demographics (e.g. health, education, assets, social capital) appear in a standardised binary form in the human wellbeing (HWB) part of the dataset. If needed, unstandardised ordinal, interval and/or continuous variables are available in the original datasets.

  • Household-level tenure variables are focused on private land resources. Variables on common property terrestrial, aquatic and marine resources are at the settlement level and can be found in the settlement-level components of this dataset (‘Resource Tenure Systems’ and ‘Settlement-level SES context’). Private ownership in fisheries tends to be related to the ownership of a fishing vessel. This information is integrated into the ‘productive assets’ variable in the HWB dataset.

  • We have included ethnicity and religion information as nominal variables. These variables can either be used as is, or can be further interpreted to group households in some more meaningful way for a particular analysis.

  • Other useful information on household occupations (e.g. the presence of ‘elite’ occupations; employment in different sectors) and wealth ranks can be derived from the ‘Household Livelihoods’ data frame.

  • We have not included variables on government aid, credit/savings (source and cost) or debt because this information was not widely available throughout the original surveys and the variables differ significantly where it was present. This information can be retrieved from the original datasets cited above if needed.

Livelihoods

From the product level data set, we get the gross value of harvests i.e. the monetary value of harvest without deducting the cost of labour and other input costs. The reason for considering the gross value of harvest is that most of the original datasets do not have robust data on cost.