## Introduction

Local-level measurements of human well-being are important for informing public service delivery and policy choices by governments, for targeting and evaluating livelihood programs by governmental and non-governmental organizations, and for the development and deployment of new products and services by the private sector. While recent work has generated granular estimates of a range of human and physical capital measures in parts of the developing world1,2,3,4,5, similar data on key economic indicators remain lacking, constraining even basic efforts to characterize who and where the poor are.

## Discussion

Our satellite-based deep learning approach to measuring asset wealth is both accurate and scalable, and consistent performance on held-out countries suggests that it could be used to generate wealth estimates in countries where data are unavailable. Results suggest that such estimates could be used to help target social programs in data poor environments, as well as to understand the determinants of variation in well-being across the developing world.

However, while our CNN-based approach outperforms approaches to poverty prediction that use simpler features common in the literature (e.g. scalar nightlights7), the information the CNN is using to make a prediction is less interpretable than these simpler approaches, perhaps inhibiting adoption by the policy community. A key avenue for future research is in improving the interpretability of deep learning models in this context, and in developing approaches to navigate this apparent performance-interpretability tradeoff.

Our deep learning approach is also perhaps best viewed as a way to amplify rather than replace ground-based survey efforts, as local training data can often further improve model performance (Fig. 3b), and because other key livelihood outcomes often measured in surveys—such as how wealth is distributed within households, or between households within villages—are more difficult to observe in imagery. Similarly, our approach could also be applied to the measurement of other key outcomes, including consumption-based poverty metrics or other key livelihood indicators such as health outcomes. Performance in these related domains will depend both on the availability and quality of training data, which remains limited for key outcomes such as consumption in most geographies. Finally, our approach could likely be further improved by the incorporation of higher-resolution optical and radar imagery now becoming available at near daily frequency (Fig. 1b), or in combination with data from other passive sensors such as mobile phones17 or social media platforms24. All represent scalable opportunities to expand the accuracy and timeliness of data on key economic indicators in the developing world, and could accelerate progress towards measuring and achieving global development goals.

## Methods

### Construction of asset wealth index

The asset wealth index is constructed from responses to the set of questions about asset ownership that are common across DHS countries and waves: number of rooms occupied in a home, if the home has electricity, the quality of house floors, water supply and toilet, and ownership of a phone, radio, tv, car and motorbike. Variables such as floor type are converted from descriptions of the asset to a 1–5 score indicating the quality of the asset. We then construct an asset index at the household level from the first principal component of these survey responses, a standard approach in development economics13,16. This index is meant to capture household asset ownership as a single dimension, rather than act as a direct measure of poverty. By construction, the index has a mean equal to 0 and standard deviation of 1 across households. Supplementary Table S4 provides derived loadings for the first principal component.

Survey data are derived from 43 Demographic and Health Surveys (DHS) surveys conducted for 23 countries in Africa from 2009 to 2016 (Supplementary Table S1). In addition to the asset data, each DHS survey contains latitude/longitude coordinates for each survey enumeration area (or cluster) surveyed, each roughly equivalent to a village in rural areas and a neighborhood in urban areas. We removed clusters with invalid GPS coordinates and clusters for which we were unable to obtain satellite imagery, leaving us with 19,669 clusters. To protect the privacy of the surveyed households, DHS randomly displaces the GPS coordinates up to 2km for urban clusters and 10km for rural clusters25; this introduces a source of noise in our training data.

### Validating the wealth index

The PCA-based index is quite robust to methods of calculation as well as variables included in the index. We compare our cross-country pooled PCA index to a measure that is the sum of all the assets owned, a PCA constructed from only objects that are owned (e.g. TV, radio) and not from housing quality scores which are more subjective, and country-specific asset indices created from running the PCA on each country separately. As shown in Supplementary Fig. S2, correlations between the pooled PCA index we use and these alternative variants range from r2 = 0.80 to r2 = 0.98.

### Replicating the wealth index in other contexts

We then create similar asset indices using two separate external data sets: census data from countries whose censuses report asset ownership questions, and data from Living Standards Measurement Study (LSMS) conducted by the World Bank. In the publicly available census data, a 10 percent sample of microdata geolocated to the second administrative level (roughly, district or county) is available from each country. We focus on countries with public data who conducted censuses within 4 years of a DHS survey in our main sample and which had gathered data on assets similar to what was available in DHS. We found that 8 countries (Benin, Lesotho, Malawi, Rwanda, Sierra Leone, Senegal, Tanzania, and Zambia) had all asset variables used in DHS excluding motorbike and rooms per person. (Using DHS data, we find that the original index and an index constructed excluding these two variables had an r2 = 0.99.) Our overall census sample yielded a total of 2,157,000 households observed in 656 administrative areas across these eight countries.

As census data are only georeferenced at second administrative levels, both DHS and census datasets are aggregated to the second-level administrative boundaries provided in the census data. Census data is aggregated using census household weights to construct representative district averages. A raw average across households is used to construct the corresponding DHS value; DHS and LSMS data do not provide household weights that allow construction of sub-nationally representative estimates.

We utilized asset wealth data from LSMS panel surveys for five countries (Malawi, Nigeria, Tanzania, Ethiopia, and Uganda). Cluster-level GPS coordinates are provided, with clusters in urban areas jittered up to 2 km and clusters in rural areas jittered up to 10 km. We are able to measure asset wealth for 9000 households over time in the LSMS data (roughly two orders of magnitude less than DHS), distributed over ~1400 clusters. As LSMS data follow households over time, we created a village-level panel using only households that existed in the first wave of interviews, removing any newly formed households or households that were not in later surveys. Additionally, where available, households that reported in the second survey that they had lived in their current location for less time than had elapsed since the first survey (i.e. migrant families) were removed. LSMS data were processed to try to match our DHS index as closely as possible, both by including the same assets and by matching asset quality definitions as similarly as possible. The fridge and motorbike variables were not available in the LSMS data and were excluded from the LSMS wealth index. Using DHS data, we find that the original index and an index constructed excluding the fridge and motorbike variables were highly correlated, with an r2 of 0.974. While we cannot directly compare DHS and LSMS indices at the village level, district level estimates from the two sources have an r2 of 0.60.

While our asset data cannot be used to directly construct poverty estimates—standard poverty measures are constructed from consumption expenditure data, which are not available in DHS surveys—household consumption aggregates are available in a subset of the LSMS data just described. Across six surveys in three countries, we find our constructed wealth index is fairly strongly correlated with log surveyed consumption at the village level, with a weighted r2 of 0.50 (Supplementary Fig. S3). These results are consistent with findings that asset indices and consumption metrics are typically very comparable14, and suggest that our approach to wealth prediction could perhaps be useful for consumption prediction as well, particularly as additional consumption data become available to train deep learning models.

### Satellite imagery

We obtained Landsat surface reflectance and nighttime lights (nightlights) images centered on each cluster location, using the Landsat archives available on Google Earth Engine. We used 3-year median composite Landsat surface reflectance images of the African continent captured by the Landsat 5, Landsat 7, and Landsat 8 satellites. We chose three 3-year periods for compositing: 2009–11, 2012–14, and 2015–17. Each composite is created by taking the median of each cloud free pixel available during that period of 3 years. The motivation for using three-year composites was two-fold. First, multi-year median compositing has seen success in similar applications as a method to gather clear satellite imagery26, and even in 1-year compositing we continued to note the substantial influence of clouds in some regions, given imperfections in the cloud mask. Second, the outcome we are trying to predict (wealth) tends to evolve slowly over time, and we similarly wanted our inputs to not be distorted by seasonal or short-run variation. The images have a spatial resolution of 30 m/pixel with seven bands which we refer to as the multispectral (MS) bands: RED, GREEN, BLUE, NIR (Near Infrared), SWIR1 (Shortwave Infrared 1), SWIR2 (Shortwave Infrared 2), and TEMP1 (Thermal).

For comparability, we also created 3-year median composites for our nightlights imagery. Because no single satellite captured nightlights for all of 2009–2016, we used DMSP27 for the 2009–11 composite, and VIIRS28 for the 2012–14 and 2015–17 composites. DMSP nightlights have 30 arc-second/pixel resolution and are unitless, whereas VIIRS nightlights have 15 arc-second/pixel resolution and units of nWcm−2sr−1. The images are resized using nearest-neighbor upsampling to cover the same spatial area as the Landsat images. Because of the resolution difference and the incompatibility of their units, we treat the DMSP and VIIRS nightlights as separate image bands in our models.

Both MS and NL images were processed in and exported from Google Earth Engine29 in 255 × 255 tiles, then center-cropped to 224 × 224, the input size of our CNN architecture, spanning 6.72 km on each side (30 m Landsat pixel size × 224 px = 6.72 km). Note that this means any survey cluster whose location coordinates are artificially displaced by more than 4.75 km ($$6.72/\sqrt{2}$$) is completely beyond the spatial extent of the satellite imagery. Each band is normalized to have mean 0 and standard deviation 1 across our entire dataset. The raster of wealth in Nigeria in Fig. 5 was generated by exporting non-overlapping tiles from Google Earth Engine, following the same processing steps as for model training.

### Deep learning models

Our deep CNN models use the ResNet-18 architecture (v2, with preactivation)30, chosen for its balance of compactness and high accuracy on the ImageNet image classification challenge31. We modify the first convolutional layer to accommodate multi-band satellite images, and we modify the final layer to output a scalar for regression. For predicting changes in wealth and the “index of differences” on the LSMS data, we stack together the images from two different years to create a 224 × 224 × (2C) image, where C is the number of channels in a single satellite image.

The modifications to the first convolutional layer prevent direct initialization from weights pre-trained on ImageNet. Instead, we adopt the same-scaled initialization procedure32: weights for the RGB channels are initialized to values pre-trained on ImageNet, whereas weights for the non-RGB channels in the first convolutional layer are initialized to the mean of the weights from the RGB channels. Then all of these weights are scaled by 3/C where C is the number of channels. The remaining layers of the ResNet are initialized to their ImageNet values, and the weights for the final layer are initialized randomly from a standard normal distribution truncated at  ±2. For the models trained only on the nightlights bands, we initialized the first layer weights randomly using He initialization33. When predicting changes in wealth and when predicting the index of differences on the LSMS data, we used random initialization instead, as it performed better than using same-scaled ImageNet initialization on the validation sets (see “Cross-Validation”).

The ResNet-18 models are trained with the Adam optimizer34 and a mean squared-error loss function. The batch size is 64 and the learning rate is decayed by a factor of 0.96 after each epoch. The models are trained for 150 epochs (200 epochs for DHS out-of-country). The model with the highest r2 on the validation set across all epochs is used as the final model for comparison. This is done as a regularization technique, equivalent to early-stopping. We performed a grid search over the learning rate (1e-2, 1e-3, 1e-4, 1e-5) and L2 weight regularization (1e-0, 1e-1, 1e-2, 1e-3) hyperparameters to find the model that performs the best on the validation fold. To prevent overfitting, the images are augmented by random horizontal and vertical flips. The non-nightlights bands are also subject to random adjustments to brightness (up to 0.5 standard deviation change) and contrast (up to 25% change). Additionally, for predicting changes in wealth and the index of differences on the LSMS data, we randomize the order for stacking the satellite images (i.e. stacking the before image on top of or below the after image), multiplying the label by  −1 whenever the after image was stacked on top to signify a reversed order.

When using the two nightlights bands, we set pixels in the non-present band to all zeros. This ensures that the first-layer weights for that band are not updated during back-propagation, because the gradient of the loss with respect to the weights for the all-zero band becomes zero. Furthermore, since the ResNet-18 architecture has a batch-normalization layer following each convolutional layer, there are no bias terms.

For models incorporating both Landsat and nightlights (i.e. our combined model), we trained two ResNet-18 models separately on the Landsat bands and nightlights bands, respectively, and joined the models in their final fully connected layer. In other words, we concatenated the final layers of the separate Landsat and Nightlights models and trained a ridge-regression model on top. We found that this approach performed better than stacking the nightlights and Landsat bands together in a single model.

For DHS data, an average of 25.59 households (standard deviation = 5.59) were surveyed for each village, compared to an average of 6.37 households (sd = 3.57) in LSMS. Due to the lower number of households surveyed for LSMS, which results in noisier estimates of village-level wealth, we weighted LSMS villages proportional to their surveyed household count in the loss function during training. We did not weight DHS villages.

### Transfer learning models

We compared our end-to-end training procedure with the transfer learning approach first proposed by Jean et al.8. In this approach, nightlights are a noisy but globally available proxy for economic activity (r2 ≈ 0.3 with asset wealth), and a model is trained to predict nighttime lights values from daytime multispectral imagery. This process summarizes high-dimensional input daytime satellite images as lower-dimensional feature vectors than can then be used in a regularized regression to predict wealth.

Because our images have a mixture of DMSP and VIIRS values, and the two satellites have different spatial resolutions, the binning approach in Jean et al.8 that treated nightlights prediction as a classification problem was unworkable. Instead, we framed transfer learning as a multitask regression problem. We extracted the neural network’s final layer output predictions for both the DMSP value and the VIIRS value, and regressed on whichever nightlights label was available for each daytime image. On the nightlights prediction task over locations sampled from all 23 DMSP countries, our transfer learning models achieved performance of r2 = 0.82 when using RGB bands and r2 = 0.90 when using all Landsat bands; these values are not directly comparable to results in Jean et al.8, as that work posed nightlights prediction as a 3-class classification problem. With these models trained to predict nightlights values from daytime imagery, we froze the model weights and fine-tuned the final fully connected layer to predict the wealth index. We note that our transfer learning experiments contain a much larger set of countries than the Jean et al.8 results, which focused on five countries, and thus are not directly comparable.

### Baseline models

We train simpler k-nearest neighbor models (KNN) on nightlights that predict wealth in a given location i as the average wealth over the k locations with nightlights values closest to that in i. In essence, this model allows a non-linear and non-monotonic mapping of nightlights to wealth. The hyperparameter k is tuned by cross-validation. We also train a regularized linear regression on scalar nighlights (scalar NL) as a baseline model.

### Training on limited data

To evaluate how models perform in even more data-limited situations, we trained our deep models on random subsets of 5%, 10%, 25%, 50%, and 100% of the full training data, repeated over 3 trials with different random subsets. For each subset size, we report the mean r2 over the three trials (Fig. 3c).

### Data splits

For both DHS and LSMS survey data, we split the data into 5 folds of roughly equal size for cross-validation. For the DHS out-of-country tests, we manually split the 23 countries into the 5 folds such that each fold had roughly the same number of villages, ranging from 3909 to 3963 (Supplementary Table S2). As described below, models were trained using cross-validation to select optimal hyperparameters. Each model was trained on 3-folds, validated on a 4th, and tested on a 5th. The fold splits used in the cross-validation procedure are shown in Supplementary Table S3. For DHS in-country training, we split the 19,699 villages into 5 folds such that there was no overlap in satellite images of the villages between any fold, where overlap is defined as any area (however small) that is present in both images. We used the DBSCAN algorithm to group together villages with overlapping satellite images, sorted the groups by the number of villages per group in decreasing order, then greedily assigned each group to the fold with the fewest villages. We followed the same procedure to create 5 LSMS in-country folds. We did not perform out-of-country tests with LSMS data.

### Cross-validation

For each of the input band combinations (MS, MS+NL, NL), we trained five separate models, each with a different test fold. Of the four remaining folds, three folds were used to train the models, with the final fold designated as the validation set used for early stopping and tuning other hyperparameters (Supplementary Table S3). Once the CNNs were trained, we fine-tuned the last fully connected layer using ridge regression with leave-one-group-out cross-validation. In the out-of-country setting, we fine-tuned the final layer individually for each test country, using data from all other countries. Thus, the convolutional layers in the CNNs have effectively seen data from four of the 5-folds, while the final layer sees data from every country except the test country. In the in-country setting, we only used data from the non-test folds for fine-tuning.

Ideally, the hyperparameters for machine learning models should be tuned by cross-validation for optimal generalization performance on unseen data. However, because training deep neural networks requires substantial computational resources, leave-one-group-out cross-validation is prohibitively time intensive (where in our setting, each group is a country). Consequently, we performed leave-one-fold-out cross-validation for all the hyperparameters for the body of the CNN, and only used leave-one-group-out cross-validation to tune the regularization parameter for training the weights in the final fully connected layer.

### Comparison with previous benchmarks

Our model achieves a cross-validated r2 = 0.67 on pooled cluster-level observations in held-out countries (or r2 = 0.70 when averaging over r2 values from each country). This meets or exceeds published performance on related tasks, including using high-resolution imagery and transfer learning to predict asset wealth in five African countries8 (r2 = 0.56), using call detail records to predict asset wealth in Rwanda17 (r2 = 0.62), and using survey data and geospatial covariates to predict housing quality5 (r2 = 0.67), child stunting1 (r2 = 0.49), diarrheal incidence2 (r2 = 0.47 averaged over years) across sub-Saharan Africa or to predict standard of living in Senegal18 (r2 = 0.69). All values are for published cross-validated performance at the cluster or pixel level (except for diarrheal incidence whose performance is only reported at the admin-2 level).

As our primary focus is on constructing and evaluating out-of-country predictions, our results are not directly comparable to findings from other small area estimate approaches that rely on having in-country surveys with which to extract covariates and make local-level predictions (e.g. refs. 35,36). However, our satellite-derived wealth estimates and/or the satellite-derived features themselves could be used as input to these small area estimates, and evaluating the utility of satellite-derived data in such settings is a promising avenue for future research.

### Research and policy experiments

To study whether our satellite-based estimates can be used to shed light on the determinants of the spatial distribution of wealth—a longstanding research question—we match our ground-based and satellite-based wealth estimates to gridded data on maximum temperature in the warmest month21. We study temperature as our potential wealth determinant because past work has suggested that differences in temperature exert significant, non-linear influence on economic output19,20, because temperature data are readily available for all our study locations.

We extract the maximum average monthly temperature for each cluster in our dataset (averaged over the years 1970–200021) and then flexibly regress wealth estimates on temperature:

$${w}_{i}=f({T}_{i})+{\varepsilon }_{i}$$
(1)

where wi is the wealth estimate for cluster i and f(Ti) is a fourth-order polynomial in temperature. To capture uncertainty in our estimates of f(), we bootstrap Eq. (1) 100 times for each different wealth measure, sampling villages with replacement. We compare estimates of f() when we measure wi using the ground data or when using various satellite-based estimates: our benchmark MS + NL estimates, or the two main other published approaches, CNN transfer learning8 and scalar nightlights7. Results are shown in Fig. 5a. We emphasize that these cross-sectional estimates of f() do not represent causal estimates of the impact of temperature on wealth, as many other factors are known to be correlated with both temperature and wealth (e.g. institutional quality, disease environment, nearby trading partners, etc.)22.

To study whether our satellite-based estimates can be used for policy tasks, we evaluate the hypothetical targeting of a social protection program (e.g. a cash transfer), in which all villages below some asset level receive the program and villages above that level do not. Such targeting on survey-derived asset data is a common approach to program disbursement in developing countries23. Because asset indices constitute a relative measure of wealth and it is not obvious how to set an absolute cut-off to define who is poor, standard practice is instead to divide the population into percentiles in the asset distribution and then designate bottom percentiles as poor15.

We follow that practice here. Using the ground data, we define a threshold $${w}_{p,g}^{* }$$ corresponding to a chosen percentile p in the ground-measured asset distribution, and designate any village with wealth below that threshold as a program beneficiary (a treated village), i.e. $${t}_{i,g,p}={\mathbb{1}}[{w}_{i,g}\, <\, {w}_{p,g}^{* }]$$, where wi,g is village i’s measured wealth in the ground data and ti,g,p denotes that villages treatment status according to the ground data. We then follow the same procedure for a satellite-estimated wealth distribution s, choosing the same percentile p in the satellite-estimated distribution to define treatment. This yields each village’s treatment status under the satellite-derived estimates $${t}_{i,s,p}={\mathbb{1}}[{w}_{i,s}\, <\, {w}_{p,s}^{* }]$$. We note that we are fixing p between ground- and satellite-based estimates rather than fixing the wealth threshold, such that the same overall number of villages are treated in both the ground-measured case and the satellite-measured case.

Under the assumption that the ground-derived treatment statuses ti,g are correct, we then define targeting accuracy As,p as the proportion of satellite-derived treatment statuses that are correct under a given percentile cutoff p, i.e. $${A}_{s,p}=\frac{1}{n}\mathop{\sum }\nolimits_{i = 1}^{n}{\mathbb{1}}[{t}_{i,s,p}={t}_{i,g,p}]$$, where n is the total number of villages. We compute As under different values of p ranging from the 10th to the 50th percentile, and for the same three different satellite-based wealth estimates s (MS+NL, transfer learning, and scalar NL) used in Fig. 5. We emphasize that to the extent that the ground data wi,g are measured with noise, which we have strong evidence of (see Supplementary Fig. S12 and Fig. 2e, f), our estimated targeting accuracy likely understates true targeting accuracy.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.