Global poverty estimation using private and public sector big data sources

Household surveys give a precise estimate of poverty; however, surveys are costly and are fielded infrequently. We demonstrate the importance of jointly using multiple public and private sector data sources to estimate levels and changes in wealth for a large set of countries. We train models using 63,854 survey cluster locations across 59 countries, relying on data from satellites, Facebook Marketing information, and OpenStreetMaps. The model generalizes previous approaches to a wide set of countries. On average, across countries, the model explains 55% (min = 14%; max = 85%) of the variation in levels of wealth at the survey cluster level and 59% (min = 0%; max = 93%) of the variation at the district level, and the model explains 4% (min = 0%; max = 17%) and 6% (min = 0%; max = 26%) of the variation of changes in wealth at the cluster and district levels. Models perform best in lower-income countries and in countries with higher variance in wealth. Features from nighttime lights, OpenStreetMaps, and land cover data are most important in explaining levels of wealth, and features from nighttime lights are most important in explaining changes in wealth.

Table S1 shows the 59 countries used for estimating levels of wealth and the survey year used, the number of survey clusters within each country, and the average and standard deviation of the global wealth index.Table S2 shows the same, but for the 33 countries used for estimating changes in wealth.

S2 Comparing DHS Wealth Index and Global Wealth Index
DHS provides an asset-based wealth index; however, the wealth index is not comparable across countries.
We create a globally comparable wealth index by taking the first principle component of a set of asset variables across the entire dataset-pooled across countries and across time.Figure S1 shows the association between the original DHS wealth index and the global wealth index we create for each country.The indices are strongly associated across countries.54 of 59 (92%) countries have an r 2 over 0.9, where the minimum r 2 between the the two indices is 0.36 in Albania.For example, most Facebook variables in Liberia have a correlation of 0.7 and above.Our data shows, though, that Liberia has low Facebook penetration: 14% of the population was active in the month where we queried data, compared to the median of 33% across the 59 countries.In a country with lower Facebook penetration, the variables may be highly correlated with wealth due to just indicating the presence of any active Facebook users in the location-where the presence of Facebook users may be indicative of higher wealth.Figure S2 also illustrates which Facebook variables tend to see higher correlations with wealth across countries.In particular, the "interest" variables tend to see higher correlations with wealth.

S4 Within-Country Correlations of Top Features
To understand the extent to which different variables across datasets capture the same dynamics, figure S3 shows the correlation between select features (panel A) and the standard deviation of correlations across countries (panel B).We use the feature with the highest correlation with the DHS wealth score for each dataset.Many of the features most correlated with wealth also see high correlations with each other; nighttime lights, length of residential roads, and urban land cover all have more than a 0.7 correlation with each other.These features also see low standard deviations in correlations across countries, indicating the correlation between these features is strong across countries.Figure S4 shows the scatterplot of true and estimate levels of wealth for each country at the survey cluster level and figure S5 shows the same when aggregating data to the district level.The scatterplots illustrate the strong association between true and estimated wealth in many countries.For each DHS survey cluster, we compute the error between model prediction and true wealth.Here, we test whether the magnitude of this error (taking the absolute value of the difference between true and predicted wealth) is correlated with factors including nighttime lights, whether the cluster is classified as urban or rural, the geographic region, and country income level.We examine both errors for estimating levels of wealth and changes in wealth.
Tables S3 and S4 show results from regressing the error on the select factors.Figures S10 and S11 show results when examining each factor independently.For both levels and changes, error estimates tend to be slightly lower in rural locations, in Africa, and in low and lower middle-income countries compared to upper middle-income countries.For levels of wealth, when nighttime lights are low, there is a large variation in error; at the highest levels of nighttime lights, error estimates are low.Table S5 compares the standard deviation of wealth within districts and across districts for each country, using the wealth asset index.Out of the 59 countries, 35 (59%) have a larger standard deviation across districts compared to within districts.In this paper, we leverage an asset-based measure of wealth.An asset index approach is typically used when neither income nor expenditure data are available, as is the case with DHS data.To test the sensitivity of our results to using a measure of consumption, we leverage data for six countries using the sources of the World Bank poverty and inequality measures, the Living Standards Measurement Surveys (LSMS), available at http://pip.worldbank.org/home.LSMS data provide expenditures or consumption data used for the World Bank (monetary) poverty estimates and includes assets, which we used to construct an asset index using the same methods we use with the DHS data.Unlike DHS data, the surveys are not standardized, and LSMS does not publicly release GPS coordinates of survey clusters for all surveys.
For this analysis, we focus on sub-Saharan Africa where the method is more likely to be used, and for countries for which LSMS data is available around 2016-2019.We end up with the following sample of countries: Burkina Faso, Benin, Cote d'Ivoire, Ethiopia, Malawi, and Togo. Figure S14 shows the association between consumption and the wealth index for each country.The wealth index is positively and significantly associated with consumption and explains 40-66% of the variation in consumption depending on the country.
We retrain the machine learning model using data from LSMS, separately training the model to estimate the wealth index and consumption.We leverage the XGBoost algorithm to train models.For each country, we divide the country into five folds-where each fold is geographically separated-and use four folds to train a model to estimate asset wealth or consumption in the left-out fold.
Figure S15 shows that the model better estimates the asset-based wealth index compared to consumption in all countries.Despite differences in the wealth index and consumption, similar sets of features are most important in estimating both wealth indicators.Figure S16 shows the model performance when select sets of features are used to train the model.Models trained on nighttime lights, daytime and nighttime lights, and OpenStreetMaps perform best for both wealth indicators, while models trained on weather/climate features, SAR data, and Facebook features perform worse.

S15 Comparing DHS and Facebook education variables
The paper tests the ability of variables from Facebook Marketing data to estimate wealth.In this section, we test the ability of Facebook data to capture a similar variable from the Demographic and Health Surveys (DHS).We use a variable that is captured in both data sources: the proportion of those with higher than a high school education.From DHS, we use the proportion of household members in each cluster that has a higher than secondary education; from Facebook, we use the proportion of monthly active users that report having higher than a high school education.We estimate the correlation between the two variables both at the cluster and district levels, and restrict the analysis to countries with 30 or more districts.
Figure S17 shows the distribution of the within-country correlation at the cluster and district level.
Correlation using both unit types has a large variation, with countries seeing both low and high correlations.However, the median correlation across countries at the cluster and district level is 0.41 and 0.53, respectively, showing that: (1) in most countries, above high school education captured by DHS and Facebook move roughly together; and (2) the correlation is larger at a higher aggregation.Figures S18 and S19 show scatterplots of the two variables across countries.
In figure S20, we attempt to explain the variation in correlation using (1) the number of units used to compute the correlation, (2) country population, and (3) the proportion of the population active on Facebook (relying on monthly active users for the month when the data from Facebook was queried).
The figure shows no notable association between the within-country correlation and the country-level variables.

S5
Scatterplots of True and Estimated Levels of Wealth for Each Country S8 S6 Estimating Levels of Wealth: Pooled Results by Continent S11 S7 Model Performance Estimating Levels of Wealth for Each Country and Feature Set S12 S8 Scatterplots of True and Estimated Levels Changes in Wealth for Each CountryS14 S9 Explaining Error Variation S16 S10 Application: Estimating Wealth in Different Years using First Administrative Division S19 S11 Comparing Results Across Machine Learning Algorithms S21 S12 Variation in Wealth: Within and Across districts S22 S13 Comparison of Results to Other Papers S24 S14 Wealth asset index and consumption comparison S25 S15 Comparing DHS and Facebook education variables S28

Figure S1 :
Figure S1: Association between DHS wealth index and global wealth index for each country

Figure S2 :
Figure S2: Correlation of Facebook features with wealth index

Figure S3 :
Figure S3: Correlation of variables between each other.We use the variable with the highest correlation to the wealth score in each dataset.Panel A shows the average correlation across countries and panel B shows the standard deviation in correlations across countries.The variables are ordered top to bottom and right to left according to their correlation with the wealth score.

Figure S4 :
Figure S4: Scatterplot between true and estimated levels of wealth for each country using survey clusters as the unit analysis.r 2 is the squared Pearson correlation coefficient, and R 2 is the coefficient of determination.

Figure S5 :
Figure S5: Scatterplot between true and estimated levels of wealth for each country when aggregating to districts.The r 2 is the squared Pearson correlation coefficient, and R 2 is the coefficient of determination.

Figure S6 :
Figure S6: Scatterplot between true and estimated levels of wealth for each country when aggregating to districts.The r 2 is the squared Pearson correlation coefficient, and R 2 is the coefficient of determination.

Figure S7 shows model performance (r 2
FigureS7shows model performance (r 2 between true and estimated wealth) for each country when training on each set of features.The figure illustrates variation across which feature sets work well across countries.Some countries are fairly consistent in most sets of features either working well or not working well.For example, no models trained on different sets of features work particularly well in Comoros, which indicates there may be something about the country or survey data that may result in the wealth estimation not working well.

Figure S7 :
Figure S7: Model performance estimating levels of wealth for each country and feature set

Figure S8 :
Figure S8: Scatterplots of changes in true and estimated changes in the wealth index at the cluster level.r 2 is the squared Pearson correlation coefficient, and R 2 is the coefficient of determination.

Figure S9 :
Figure S9: Scatterplots of changes in true and estimated changes in the wealth index at the district level.r 2 is the squared Pearson correlation coefficient, and R 2 is the coefficient of determination.

Figure S14 :
Figure S14: Comparison of wealth asset index and consumption

Figure S17 :
Figure S17: Distribution of within-country correlation between the proportion of the population with above high school education as measured by DHS and Facebook.The boxplots include center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points beyond whiskers, outliers.

Figure S18 :S29Figure S19 :Figure S20 :
Figure S18: Cluster-level scatterplot between proportion with above high school education as measured by Facebook and DHS

Table S1 :
DHS summary statistics of countries used for estimating levels of wealth

Table S2 :
DHS summary statistics of countries used for estimating changes in wealth

Table S3 :
Explaining error (absolute value of true minus predicted wealth) based on select factors

Table S5 :
Comparing standard deviation in wealth within and across districts