Introduction

The first of the United Nations sustainable development goals (SDGs) is poverty eradication (United Nations General Assembly, 2015), and achievement of this goal depends on regular and reliable estimates of the number of people in poverty and where they live. This information can be difficult to attain yet is critically important to development agencies, foundations, NGOs, and governments working toward alleviating poverty within low- and middle-income countries (LMICs). Low socioeconomic status within a country is associated with significant health problems; for example, malaria, child mortality, and population growth have all been linked to poverty (Tusting et al., 2013; Målqvist, 2015; UNFPA, 2014). The geographic identification of poor populations at those at high-risk of poverty susceptibility is of paramount importance when developing measures to target the vulnerable. Detailed poverty maps that quantify the spatial distribution and magnitude of economic impoverishment are essential for progress toward poverty eradication, and the availability of high-quality, timely, and disaggregated data is necessary for evidence-based decision making for implementing the 2030 agenda (United Nations, 2017).

Subnational estimates of poverty are typically made using data from national censuses and household surveys within a small area estimation (SAE) modeling framework. This method utilizes the content detail of household surveys along with the coverage of the census to produce subnational estimates of the proportion of households living in poverty (Elbers et al., 2002a, 2002b; Hentschel et al., 1998). As censuses are a main input of SAE models, detailed and reliable estimates of poverty at high granularity would be possible with a timely and complete census for each country. However, the release of census data, and accompanying data on subnational unit boundaries, typically occurs every 10 years. Censuses can be delayed, missing, incomplete, unreliable, or unavailable for many low- and middle-income countries (LMICs), making the estimation of development indicators especially difficult in the highest burden areas. Furthermore, in some LMICs the administrative boundaries and thus spatial availability of census data can be too coarse to produce reliable subnational estimates, or accurate data on admin boundaries are altogether not available (Jerven, 2013). These factors have led researchers to explore new sources of data and methodologies for estimating socioeconomic status that are independent of data from censuses to meet the need for more frequent updates and finer spatial detail in estimating poverty.

One such source of data comes from features derived from call detail records (CDRs) collected by mobile network operators (MNOs). CDRs contain the metadata, but not the content, of communications between millions of people at a time. These data also often include information on user location and social ties, which combined can give insight into population-level movement and social networks (Blondel et al., 2015). Data on the amount and frequency of airtime recharges, i.e. top-ups, on mobile phones provides direct information on user consumption, which can be an important predictor of poverty measures (Steele et al., 2017). In addition, compared to the information collected via surveys, CDRs are largely considered free from the bias of self-reporting. Whereas information from observed behavior has been shown to differ from self-reported behavior due to the perception of subjects themselves (Eagle et al., 2009), CDRs are derived from only observed behavior, e.g. mobility records, calling patterns, top-up amounts and frequencies.

CDR data come with limitations and biases and it is important to consider these, as a dataset from one MNO in a country will not necessarily reflect the general population. In many cases the ability to quantify self-selection bias in the mobile user population as compared to the general population is impracticable. However, there are some known factors that contribute to data bias and limitations when using CDRs as a proxy to study the general population. Studies have shown that mobile phone ownership, and thus data generation, is skewed toward the educated, males, urban populations, and wealthier people (Stork, 2011; Wesolowski et al., 2012). Furthermore, multiple telecom companies usually operate within a country and CDRs from a single operator represent only the portion of the population that comprises their market share. Individual access to a mobile phone is also dependent on mobile reception, the ability to afford a mobile handset and top-ups, and electricity to recharge the device (Stork, 2011). These factors vary geographically, and mobile phone ownership and coverage can be low in rural and remote areas. Moreover, not all CDR datasets comprise a year’s worth of data, adding additional bias due to seasonal activities and mobility: studies have shown that time of year directly affects population densities and their consequent characteristics spatially, which influences and is reflected by CDR data (Wesolowski et al., 2017; zu Erbach-Schoenberg et al., 2016). All of these factors contribute to self-selection bias, where the poorest members of the population, and especially those with intersecting forms of social marginalization, may not have access to mobile phones, may not be included temporally due to seasonal activities, and are thus absent from the data.

Despite these drawbacks and biases, CDRs have been shown to provide useful information on the spatio-temporal variation in poverty and wealth. Studies using CDR data to infer socioeconomic status have formed a significant branch of work within the data for development domain, with digital technologies being used to design, implement, and monitor development projects (Data for Development, 2017; GMSA, 2016; GSMA, 2014; OPAL, 2017). Researchers are increasingly making use of the vast information present in CDRs to quantify socioeconomic status in individuals and populations at high spatial and temporal resolution (Blumenstock et al., 2015, 2010; Eagle et al., 2010; Frias-Martinez and Virseda, 2012; Njuguna and McSharry, 2017; Pokhriyal and Jacques, 2017; Smith-Clarke et al., 2014; Soto et al., 2011; Steele et al., 2017). However, to date, this has only been shown for individual countries. A more widespread adoption of mobile phone data as a component of poverty estimation will rely on several factors, including both the internal validity of context-specific studies, and the generalizability of these data across multiple settings. Furthermore, in order for MNOs to make their data available for social good, multiple criteria need to be met:

  • Data privacy protection must be ensured, and legal guidelines followed,

  • Methods used for calculating and aggregating metrics must be transparent and verifiable, and

  • The additional burden on the MNO’s resources on top of their core business must be minimal.

Organizations are working to coordinate these efforts and enable the inclusion of CDR-derived features in a shareable way to leverage and harness these data for development (e.g. OPAL (OPAL, 2017), United Nations Global Pulse (United Nations Global Pulse, 2018), Global Partnership for Sustainable Development Data (Global Partnership for SDGs, 2018), World Bank (World Bank, 2016), Data Pop Alliance (Data-Pop Alliance, 2018), etc.). Data access is one of the most challenging aspects of utilizing CDRs for development goals and the availability of relevant user features—basic and advanced phone usage, handset type, revenue data, mobility and social network information, and top-ups–can vary greatly from country to country.

Here, we use a common set of CDR-derived features from Namibia, Nepal, and Bangladesh to comparatively estimate poverty within a robust modeling framework across three very different geographies in Africa and Asia. We first produced national-scale poverty maps using aggregate user features derived from all available mobile phone metadata within each individual country. We then produced national-scale poverty maps for each country using only aggregate user features that were available in all three countries, as we are interested in how a generalized set of easily replicable CDRs performs across settings in estimating asset-based poverty. Model performance was evaluated using out-of-sample cross-validation statistics (coefficient of determination (r2) and the root-mean-square-error (RMSE)) calculated on randomly selected test subset of data. We report on the CDR covariates, both generalizable and country-specific, that are significant predictors of poverty within and across all three countries. We further calculated the numbers of people living in poverty as predicted by each model in order to compare the spatial distributions of socioeconomic status. This allowed us to highlight ancillary data within a local context as being a key consideration when making use of big data for strategic development efforts and monitoring of SDGs.

Methods

All data used in this study were processed to ensure that projections, resolutions, and extents matched. We estimated the approximate reception area of each mobile phone tower via Voronoi tessellation(Okabe et al. 2009) and based the spatial scale of analysis on these coverage areas. Each Voronoi polygon was then assigned aggregate features based on the mean, sum, or mode of the corresponding CDR data. The household survey data were matched to the Voronois based on the lat/long coordinate representing the centroid of each DHS cluster; where multiple clusters fell within the same polygon, we used the mean aggregate value.

Mobile phone data

The CDR metrics used in this study were derived from network data provided by MTC Namibia, NCell in Nepal, and Grameenphone in Bangladesh. These countries were chosen based on data availability where agreements with MNOs overlapped HH survey data from the DHS. Table 1 provides the network details of MNO data including market share, penetration rate, subscribers, and number of Voronoi polygons used for this study. The primary purpose for the collection of CDRs by an MNO is to enable subscriber billing. The re-purposing of this data source for alternative use, as presented here, does not come without additional technical overheads, legal and regulatory challenges. Data availability at any given MNO is driven by numerous factors including, but not limited to, data warehouse resource availability, data retention policies, and the prioritization of operational concerns (see Supplementary Information, Notes on processing, data biases, and limitations of CDR data for more information).

Table 1 Summary table with the mobile phone penetration rate in each country, market share of each MNO, number of subscribers in each dataset, time period covered by the data, and number of towers.

All CDR indicators were calculated on an individual level and then aggregated up to the tower level based on the Voronoi cell representing the individual’s home location (as defined below and in Steele et al., 2017, SI Section A2). For example, Fig. 1 shows the outgoing call count for each Voronoi cell in the study countries. First, we calculated the outgoing call count for each user in the data; then the data were aggregated within each Voronoi polygon by calculating the mean outgoing call count for all users with coincident home cells. This process is repeated for each covariate and only the aggregate features are used in covariate selection, model fitting, and prediction. To preserve user anonymity, the operators remove all personally identifying information from the data before analysis:

  1. (i)

    All customers are de-identified and only telecom employees have had access to any detailed data

  2. (ii)

    The processing of detailed CDR/top-up data resulted in aggregations of the data on a tower level granularity; the tower-level aggregation makes re-identification impossible.

Fig. 1
figure 1

Mean outgoing call count per cell tower for Namibia (A), Nepal (B), and Bangladesh (C). The coverage areas for the cell towers are approximated by Voronoi tessellation.

Hence, the resulting aggregated dataset is anonymized and involves no personal data.

Namibia

CDR features were derived from the network data of MTC, the leading mobile phone provider in Namibia. The data set spans 12 months between 1 January and 31 December 2013 and contains 2,936,046 users. We calculated covariates for each individual and then grouped all individuals with the same home location based on their last call of the day. We conducted a sensitivity analysis on home location definition, as we are interested in capturing users at their home and not a workplace or other regularly visited location. We analyzed the following alternatives for defining user home locations: nighttime location (most used tower between 8 p.m. and 6 a.m. inclusive); most used tower irrespective of time; and location of the last call of the day. We eliminated the nighttime definition as ~130,000 users had no nighttime location and would have been omitted from our analyses. In comparing the other two definitions, we mapped users and ultimately chose last call of the day as it placed people in residential areas better than most frequently used tower. That is, the most used tower definition placed more users in central urban commercial areas than are reasonably expected to reside there. The last call of the day definition more consistently placed users where they are likely to reside in areas of non-commercial land use. We used raw data (individual call/text entries) for most of the covariates and unfilled daily locations to calculate the home location for each individual. Table 2 details the data processed for Namibia.

Table 2 CDR variables and their description in this study for each country.

Nepal

CDR features were derived from the network data of NCell mobile phone metadata collected between 1 January and 7 April 2015. These data were processed into features of user mobility, social networks, basic phone usage, and selected phone features (Table 2). Again, each user was assigned a home location based on their last call of the day, and the mean value of each indicator was calculated for all users sharing the same home tower and used in model fitting and prediction.

Bangladesh

CDR features were derived from the network data of Grameenphone (GP) mobile phone metadata collected over 4 months between November 2013 and March 2014. GP, the largest mobile network operator in Bangladesh, had 48 million customers at the time of the analysis. Table 2 details the data processed for Bangladesh used in this study. Each user was assigned a home location based on their most used tower, and the mean value of each indicator was calculated for all users sharing the same home tower. Further details are reported in the Supplemental Information, Section A.2 and Tables S1 and S2A of Steele et al. (2017).

Geolocated survey data

We utilized demographic and health surveys data from USAID (Rutstein and Rojas, 2006). These surveys are designed to collect household data on marriage, fertility, family planning, and other health indicators in nearly all lower income countries (Rutstein and Rojas, 2006). By assembling characteristics on living standards correlated with a household’s economic status (i.e. the ownership of a television, telephone, radio; descriptions of floor type, ceiling materials, other facilities), the DHS program calculates a wealth index for each country (Rutstein and Johnson, 2004). In the DHS, it is inferred that a household’s assets and access to amenities are related to its relative economic position in the country (Rutstein, 2008).

We used the per-cluster mean wealth index calculated from the 2013 Namibia DHS, the 2011 Nepal DHS, and the 2011 Bangladesh DHS. These nationally representative surveys are based on two-stage stratified sampling of households, where enumeration areas (EAs or clusters) are first selected with probability proportional to the EA size (see Supplementary InformationComputing the DHS Wealth Index for more information). The first stage provides a listing of households for the second stage, where a sample is selected per cluster to create statistically reliable estimates of key demographic and health variables (ICF International, 2012; National Institute of Population Research and Training et al., 2013). In Namibia, 550 clusters were first selected with probability proportional to the EA size (267 clusters in urban areas and 283 in rural areas); in Nepal, 289 clusters (95 clusters in urban areas and 194 in rural areas); and in Bangladesh, 600 clusters (207 in urban areas and 393 in rural areas). Geolocations representing the center of each sampling unit were collected in the field, enabling the use of robust statistical methods (accounting for smaller sample sizes and uncertainties in the data) to move from national estimates of poverty to subnational estimates necessitated by the SDGs.

Calculating people in poverty

We calculated numbers of people in poverty using WorldPop population data (Worldpop Research Group, 2017) from the most recent census year for each country (2011 for all three countries). Here the underlying assumption is each individual person takes the poverty status from his or her household as measured by the DHS wealth index. Population data were then overlaid with model outputs in ArcGIS and the sum of people in each Voronoi polygon (tower area) was calculated. The DHS divides its asset index into five quintiles and the lowest two quintiles, “poorer and poorest,” are considered to be poor (Rutstein and Johnson, 2004; Rutstein and Rojas, 2006). Following this categorization, we computed the total number of people predicted as poorer and poorest for each model (six in total) and each tower area. This allowed us to compare differences in the distribution of poverty incidence as estimated by the full and generalized models both spatially and in total numbers of people calculated to be poor.

The quality of the WorldPop population data is a function of the spatial resolution of the administrative units rather than the population densities themselves. Smaller disparities between the scale of the source (administrative units) and target (100-m pixels) results in better population estimates. For the three countries modeled here, the average out-of-bag prediction error (mean squared residuals over 500 trees) for the population data was 0.47; 0.56; and 0.39 for Bangladesh, Namibia and Nepal, respectively. The average pseudo-r2 for the models was 0.66; 0.95; and 0.80 for Bangladesh, Namibia and Nepal, respectively.

Statistical analyses

Statistical analyses were implemented using the R statistical software package (R Core Team, 2015). All analytical steps were undertaken on a per-country basis, initially using all mobile phone data available within each country. Prior to model building and prediction, all CDR data were log transformed for normality. To assess multicollinearity in the data, we computed a bivariate Pearson correlation (see Fig. 2) to identify correlations of r > 0.70. In choosing data for the modeling process, we compared results from the Pearson correlation to computations of the variance inflation factor (VIF), whereby we removed the covariate with the highest VIF from iterations until all remaining covariates had VIF < 4. In Bangladesh only, the number of covariates available in the dataset necessitated additional processing to reduce the number of covariates from approximately 150 to 14 non-collinear variables (see Steele et al., 2017, Section 2.4 for details).

Fig. 2
figure 2

Bivariate Pearson correlation plots for Namibia (A) and Nepal (B). These plots show all non-collinear CDR data from the input data detailed in Table 2. The variables shown here specify the model inputs for the full models.

Figure 3 diagrams the data and modeling steps undertaken for this study. Models were built using a randomly selected 70% of the data to guard against overfitting. We employed hierarchical Bayesian areal models to build relationships between poverty and CDR data at sampled locations, and predict poverty estimates at unsampled locations across each country. These models were chosen due to the advantages in modeling geolocated household survey data—this modeling framework allows for straightforwardly imputing missing data, specifying prior distributions in model parameters and spatial covariance, and estimating uncertainty in predictions with a full posterior distribution for each estimate (Blangiardo et al., 2013; Blangiardo and Cameletti, 2015). All models were implemented using integrated nested Laplace approximations (INLA) (Rue et al., 2009), which uses an approximation for inference to avoid the computational demands and convergence issues, which can be problematic for MCMC algorithms (Rue and Martino, 2007).

Fig. 3: Data inputs and methodological steps used in this study.
figure 3

This diagram illustrates the processes undertaken to produce national-level poverty maps for each country from raw CDR data.

Following previous work, the areal models are fit using R-INLA, with the Besag model for spatial effects specified inside the function (Blangiardo and Cameletti, 2015; Rue et al., 2009; Rue and Martino, 2007; Steele et al., 2017; The R-INLA project, 2016, 2015). Within the Besag model, gamma hyperpriors on the precision parameters τϕ and τθ are meant to make a prior which places equal emphasis on both spatial and non-spatial variance, where the precision of ϕ, τϕ is given the hyperprior gamma (1, 1) and the precision of θ, τθ is given the hyperprior gamma (3.27, 1.81) (Elbers et al., 2002a). The model accounts for spatial covariance in the data through incorporating a spatially varying random effect, which is formed by the Voronoi polygons themselves as all of the data are aligned to mobile tower locations. The Voronois are clustered across each country at varying spatial scales and neighbors are defined within a scaled precision matrix (Sørbye and Rue, 2014) built using the geographical adjacency of the mobile phone towers to explicitly incorporate the neighborhood structure of the data This allows observations to have decreasing effects on predictions that are further away (Besag and Kooperberg, 1995). In the Besag model, Gaussian Markov random fields (GMRFs) are used to model spatial dependency structures and unobserved effects. GMRFs penalize local deviation from a constant level based on the precision parameter t, where the hyperpriors are loggamma distributed (Sørbye and Rue, 2014). The hyperprior distribution governs the smoothness of the field used to estimate spatial autocorrelation (Sørbye and Rue, 2014). The spatial random vector x = (x1, …, xn) is thus defined as

$$\left. {x_i} \right|x_i,i \ne j,\tau \sim {{{\mathcal{N}}}}\left( {\frac{1}{{n_i}}\mathop {\sum}\limits_{i\sim j} {x_j,\frac{1}{{n_i\tau }}} } \right),$$

where ni is the number of neighbors of node i, ij indicates that the two nodes i and j are neighbors.

Using the fitted models, we produced estimates of the wealth index per Voronoi polygon as a posterior distribution with complete modeled uncertainty around estimates. The posterior mean and standard deviation for each polygon were then used to generate prediction maps (Fig. 4) with associated uncertainty (Fig. S4). Predictive performance of models was assessed using out-of-sample validation statistics calculated on a random 30% test subset of data; root-mean-square-error (RMSE) and the coefficient of determination (r2) was calculated for all models (Table 3). We also generated scatter plots of observed versus predicted values for visualization purposes (Fig. 4). The same modeling framework, with the same likelihoods, priors, and random spatial effect for each country was used for generalized models including only the common set of five CDR-derived features. We produced national estimates of poverty with associated uncertainty for each country as described above and again assessed model performance using out-of-sample validation statistics on a random 30% test set of data for comparison (Table 3).

Table 3 Cross-validation statistics based on a random 30% test set of data for models using all CDR features (Full model) and a common set of CDR features (Generalized model).
Fig. 4: National-level poverty estimates for each country.
figure 4

These maps illustrate the wealth index predictions for each Voronoi polygon, with associated out-of-sample validation statistics (scatterplot below corresponds to above map, showing predicted (y-axis) vs. observed (x-axis) values) for Namibia [n = 141] (A), Nepal [n = 85] (B), and Bangladesh [n = 117] (C).

Results

Poverty mapping

We produced national-scale poverty estimates using hierarchical Bayesian spatial models, with socioeconomic data from the Demographic and Health Surveys (DHS) and independent variables derived from CDR metadata. The spatial scale of analysis was based on approximating the mobile tower coverage areas using Voronoi tessellation (Okabe et al., 2009) and all data were aligned to these Voronoi polygons. CDR data are aggregated at the tower level and the resultant values apply to the entire spatial extent of each Voronoi. We aligned the socioeconomic data from the DHS to the Voronois by matching the lat/long of each household cluster to the polygon in which its centroid fell. We used the DHS wealth index (Rutstein, 2008; Rutstein and Johnson, 2004), an asset-based indicator of poverty calculated from nationally representative household survey data. We modeled the mean wealth index score of sampled populations within each Voronoi polygon, and where multiple household clusters fell within the same Voronoi, we modeled the mean aggregate value.

CDR-derived covariates varied for each country based on availability (see Table 2). Broadly, we utilized measures of user mobility, including the number of unique towers visited, entropy of places, and users’ radius of gyration—an indicator of movement trajectories (González et al., 2008); basic phone usage, such as the percentage of nocturnal calls made and outgoing/incoming counts of texts and calls; and social network features, including the number of interactions per contact and the entropy of users’ contacts. These social network features have been shown to correlate with economic well-being (Eagle et al., 2010). In Bangladesh only, we were able to access and use revenue and consumption data based on users’ recharge amounts and frequencies. All CDR-derived covariate data were aligned with wealth index data in each Voronoi and fit as areal models using integrated nested Laplace approximations (INLA) (Rue et al., 2009) to estimate poverty per tower area with associated uncertainty (Fig. 4A–C and Supplementary Information, Fig. S4A–C).

All CDR data comprising the full models for each country are presented in Table 4; we used all non-collinear CDR data for each country from Table 2 in these models. We then comprised CDR data for the generalized models by examining CDR features that were statistically significant in at least one country’s full model, and were also available in all three countries. Table 5 shows these results—statistically significant covariates from the full models are listed here with the data from the generalized models highlighted in bold italics. The variables for the generalized models are the same for each country and include: number of unique towers visited, outgoing call count, percent nocturnal communications, radius of gyration, and entropy of places.

Table 4 Full model specifications for each country.
Table 5 Mobile phone data used in country-specific and generalized poverty models.

We find models utilizing only the common set of CDR features perform nearly identically to the full suite of predictors in Namibia and Nepal, and comparatively less well in Bangladesh (Table 3). The differences in predictive performance were modest: Namibia full model r2 = 0.66, generalized model r2 = 0.65; Nepal full model r2 = 0.61, generalized model r2 = 0.60; Bangladesh full model r2 = 0.64, generalized model r2 = 0.50. Only in Bangladesh was there a notable increase in model error associated with reduced data inputs: Namibia full model RMSE = 0.48, generalized model RMSE = 0.48; Nepal full model RMSE = 0.53, generalized model RMSE = 0.54; Bangladesh full model RMSE = 0.48, generalized model RMSE = 0.57.

The number of unique towers visited and percent nocturnal calls had the strongest effect on poverty predictions in the models built using the common CDR dataset (see Supplementary Information, Tables S1S3). In addition, outgoing call counts were important in Namibia, whereas radius of gyration and entropy of places were prominent in Nepal. Given the full suite of available predictors, the number of unique towers visited and percent nocturnal calls remained significant. Results also show a few, key covariates unique to each country as these models included data that were not available for all three countries at the time of this study. In Namibia, this includes outgoing text counts and the number of users whose home location is at each tower as important covariates. In Nepal, entropy of contacts and the percentage of interactions from users’ home tower were significant. In Bangladesh, incoming text counts were important covariates, as well as measures of top-ups, multimedia messaging, and Internet usage.

Spatial distribution of poverty

The DHS divides its asset index into five quintiles and the lowest two quintiles are considered poor (Rutstein and Johnson, 2004; Rutstein and Rojas, 2006). To explore the spatial distributions of poverty, we calculated the total number of people in the lowest two quintiles for each model. The mean wealth index score modeled in Namibia and Nepal is bimodal in distribution, with a higher proportion of households falling into lower quintiles (see Supplementary Information, Fig. S1A, B). In Bangladesh, the mean wealth index score modeled is positively skewed, with far greater numbers of households in lower quintiles (Fig. S1C). We would expect a more accurate model to reflect these input data and predict greater numbers of people in poverty nationally in Bangladesh, and better differentiation of poverty and wealth in Namibia and Nepal (more poverty, more wealth, and fewer middle-class geographies). Models using all CDR-derived features produced marginally better outputs in terms of prediction and error, while also including additional data specific to each country; thus, we expect greater numbers of people in poverty to be predicted by the full models. Models using only the common subset of CDR data could leave out poor people and fail to capture differences across geographies due to incompleteness of data, higher model error, or lower predictive power.

In Namibia, the full model predicts 909,432 people in poverty versus 857,761 predicted by the generalized model. There are small shifts in the spatial distribution of poverty where areas are predicted to be poorer or richer, but in general the patterns in the urban centers and north/south regional trends hold (see Supplementary Information, Fig. S2). In Bangladesh, the differences are striking, both in respect to the total numbers of people in poverty (full and generalized models predict 17,107,057 and 9,832,711 poor people, respectively) and in their spatial distribution. The additional CDR data used in the full model (text counts, top-up data, multimedia messaging data, and Internet usage) produce a map with greater precision and distinction between poverty and wealth (Fig. S3) as compared to the generalized model, which it predicted most areas in the middle class.

Nepal demonstrated a different outcome, where the generalized model predicted greater numbers of people in poverty (generalized: 6,707,748 and full: 6,436,490). Likewise, the spatial distribution of the predictions shifted appreciably between the models (Fig. 5). To explore this, we looked at the covariates that were having the greatest effect on model outputs along with existing benchmarks of population density and socioeconomic status. Entropy of places, radius of gyration, and number of unique towers visited have the greatest effect on outputs from the generalized model (see Table S2), which are all measures of user mobility. Also, higher wealth is predicted in the national parks, where an increase in mobility from tourism could be a contributing factor (Nepal Ministry of Culture, Tourism & Civil Aviation, 2014). In terms of absolute change, more and greater levels of poverty are predicted across the southern regions of Nepal in the generalized model.

Fig. 5: National wealth index maps produced for Nepal.
figure 5

This figure shows maps produced using all available noncollinear CDR data (A), a CDR subset comprised of 5 generalizable features (B), and the difference between these two models (C). For (A) and (B), the subset maps show, in black, poor areas predicted by each model (the Demographic and Health Surveys class the two lowest quintiles, poorer and poorest, as poor).

By incorporating recent population (Worldpop Research Group, 2017) and poverty data for Nepal (Bank, 2013; Haslett et al., 2014; Nepal and Bohara, 2010), it became evident that although poverty incidence in rural Nepal—predominately in the north and northwest—is higher than in urban Nepal, the numbers of absolute poor are higher across the southern regions—especially in the south-southeast—due to higher population densities. In the case of our two models, the generalized model more accurately reflects this. The generalized model, as driven by mobility data, predicts greater levels of poverty in regions of high population density that concurrently have lower mobility. This more accurately reflects the higher total numbers of people in poverty and demonstrates poorer people to have lower mobility than wealthier people, matching findings elsewhere (Wesolowski et al., 2013).

Discussion

The results here demonstrate that five easily replicable, population-level CDR-derived features are able to account for 50–65% of the variance in socioeconomic status nationally across Namibia, Nepal, and Bangladesh, highlighting how a smaller set of data are able to contribute to monitoring and mapping poverty metrics across countries. This work represents the first attempt to generalize CDR-derived features across countries to predict poverty. We are able to identify aggregate information reflecting user’s mobility and call behavior as having a key role in explaining the distribution of poverty in very different contexts. The results provide evidence-based support for including aggregated, anonymized CDRs wherever possible as a non-trivial data component for strategic poverty measurement and monitoring, and demonstrate that CDRs do give reasonable estimates of the distributions of socioeconomic status across LMICs.

Although our aim is not to determine causation, or the determinants of poverty, we thought that data related to user mobility and call patterns would correlate well with socioeconomic status and considered the following explanations:

  1. 1.

    We expect higher levels of mobility (as measured by the radius of gyration and entropy of places) lead to a higher level of socioeconomic status (Wesolowski et al., 2013), with the idea being that wealthier people are more mobile and visit more places than poorer people.

  2. 2.

    We expect a higher percentage of nocturnal calls correlates to a lower level of socioeconomic status, with the idea being that nighttime rates are cheaper so poorer people will do more of their communications during these “off peak” times.

  3. 3.

    We expect a higher count of outgoing calls leads to a higher socioeconomic status, with the idea being that the initiating party pays for outgoing calls. Whereas receiving a call does not result in a charge.

Mobility features were most important in explaining the variation in poverty across Nepal, whereas in Namibia call pattern data were more significant. Both types of data were needed in Bangladesh to achieve 50% explained variance in the generalized model. Text counts (incoming and outgoing) were unobtainable for Nepal at the time of this study, but were important features in mapping poverty in Namibia and Bangladesh. This is not surprising as texting is customary in these countries and people may text each other more than call. In Nepal, event durations were more important than event counts—length of incoming/outgoing calls was a better predictor of socioeconomic status in this context, suggesting that measures of event duration capture important information on consumption and expenditures. Additional data—especially top-ups, Internet usage, and SMS communications are expected to improve poverty maps wherever these data are available. Further exploration is needed in terms of the relationship between distance-based CDR features and tower locations to quantify the extent to which some of the mobility and entropy covariates are providing unique information versus being a function of tower distribution. The final sets of covariates here were determined largely by availability, MNO agreements, and relevance in previous work.

Poverty maps produced with CDR-derived features need to be interpreted within a particular locality. Poverty is highly context-specific and factors associated with poverty can vary considerably from country to country. Understanding key country-specific data together with information on how people use their phones is essential. In Bangladesh, direct measures of consumption—top-ups, data on Internet usage and multimedia messaging features—were necessary to capture the variability in poverty and detect the poorest households present in the household survey data. As demonstrated in the Nepal models, the spatial distribution and estimates of numbers of absolute poor can shift significantly based on different types of input data. Incorporating additional information on geographical conditions and phone usage yields reciprocal benefits. For example, population densities and demographic data provide insight into the mobility patterns of low-income people, and this information highlights how well model outputs are estimating fine-scale variation in poverty. This could reduce the inadvertent exclusion of people who are poor from estimates, and ultimately programs designed to reduce poverty.

In the absence of a census, the method applied here is able to estimate poverty reasonably well as measured by the DHS wealth index. We chose the wealth index for this study as it is widely available. The DHS produces estimates approximately every 5 years for many LMICs (The DHS Program Country List, 2018), depending on survey type, instrument, and sample size (The Demographic and Health Survey (DHS), 2018). As such, it could be feasible using these data to construct high-resolution estimates of asset-based poverty 2–3 times before 2030. With additional household survey data—using similar methodologies and variables to construct a wealth index—more points in time could be produced. Therefore, it must be noted that the method applied here captures long-term poverty trends (i.e. 5-year changes in assets and living standards) rather than short-term developments (i.e. 6–12 months changes in consumption or expenditure). Ideally we would also test income- and consumption-based metrics of poverty to better understand how well the applied method and data could capture these short-term changes in socioeconomic status but those data were not available at the time of this study. To that end, it would be incredibly useful to incorporate other types of survey data to test how well CDRs or other types of ‘big data’ for that matter can estimate short-term changes in poverty and wealth during intercensal periods and integrate these estimates temporally. Evaluating the extent to which features derived from CDRs can capture these short-term fluctuations would be requisite for a proper evaluation of their usefulness as compared to traditional surveys.

We inevitably had a mismatch in years of CDR and survey data for Nepal and Bangladesh. Where both datasets were concurrent in Namibia, we achieved the best results—highest predictive power and lowest error, with no appreciable difference between the full and generalized models, highlighting the importance of matching data sources temporally. As demonstrated in previous work (Chen and Nordhaus, 2011; Elbers et al., 2002a; Head et al., 2017; Jean et al., 2016; Njuguna and McSharry, 2017; Noor et al., 2008; Steele et al., 2017; Watmough et al., 2016), data from satellites and user-generated GIS platforms are important data sources expected to improve predictions, especially in rural areas where mobile towers can be sparse. Fewer model features provide computational tractability of analysis and interpretability for policy makers or non-specialists. Nevertheless, as computing power and algorithm development progress, we will be able to extract these types of measures faster and with increased accuracy. This will make understanding how people use technology, and their geographical conditions, even more important when making inferences from big data to derive solutions to improve lives.

Progressing the global development agendas requires identification of the poor, and CDRs can contribute to these efforts by providing timely, accurate updates on socioeconomic status in populations for monitoring and evaluation. These data also offer the potential of dynamic measurement and the ability to evaluate change over time. Although significant challenges in accessing these data and distributing outputs remain, we are optimistic that studies such as this demonstrating the usefulness of aggregate, anonymous CDR data will encourage mobile operators to continue to collaborate with researchers, development agencies, and governments working toward development goals. Part of this process necessarily includes getting data on the political agenda to connect the supply and demand of data, and create enabling environments for data to flow across systems and users. This could yield increased efficiency and foster the incorporation of real-time data into how the SDGs are being addressed.