Decay radius of climate decision for solar panels in the city of Fresno, USA

To design incentives towards achieving climate mitigation targets, it is important to understand the mechanisms that affect individual climate decisions such as solar panel installation. It has been shown that peer effects are important in determining the uptake and spread of household photovoltaic installations. Due to coarse geographical data, it remains unclear whether this effect is generated through geographical proximity or within groups exhibiting similar characteristics. Here we show that geographical proximity is the most important predictor of solar panel implementation, and that peer effects diminish with distance. Using satellite imagery, we build a unique geo-located dataset for the city of Fresno to specify the importance of small distances. Employing machine learning techniques, we find the density of solar panels within the shortest measured radius of an address is the most important factor in determining the likelihood of that address having a solar panel. The importance of geographical proximity decreases with distance following an exponential curve with a decay radius of 210 meters. The dependence is slightly more pronounced in low-income groups. These findings support the model of distance-related social diffusion, and suggest priority should be given to seeding panels in areas where few exist.

3. Feature importance scores by variable, for the main AdaBoost model re-calculated using all density variable radii.
4. An evaluation of all tested models without the inclusion of panel density variables. 7. Presentation of feature importances without the subtraction of the previous radius.
8. An analysis of the decay in panel density importance over increasing radii when the data is subset by the number of households in the census tract and tract area. 9. Presentation of calculated p-values for variables calculated in Figure 3.
10. The distribution of household income in the dataset, by which the data is subset.

3
11. An analysis of normalized panel density importance when data is subgrouped by median home value.
12. An analysis of normalized panel density importance when data is subgrouped by income.
13. Descriptive statistics for census tract level features.
14. Presentation of the correlation score and direction for all normalized panel density variables with the outcome.

Random Forest
While it demonstrated the best overall performance metrics, the Random Forest model was discarded because while it did well in identifying addresses without panels, its performance on the minority class sacrificed its performance on the majority class: comparatively to the AdaBoost model, it misclassified a larger percentage of those houses with panels.

XGBoost
The XGBoost algorithm was discarded for demonstrating the poorest performance metrics of the three algorithms tested, as shown in the following figures and tables.

Supplementary Discussion 2
We provide the confusion matrices along with overall performance metrics (Total Accuracy, the Area Under the Receiver Operating Characteristic Curve, and the Area Under the Precision-Recall Curve) for all models, but built without the inclusion of density variables. Across all models, we find that model performance suffers significantly, especially when compared with the models with the smallest radius density calculations, further indicating the importance of the density variables to the accurate prediction of the existence of panels at a particular address.

Figure S7. Confusion matrices for all models run without density variables, showing the number of correctly and incorrectly classified addresses of each type (Panel and No Panel). All confusion matrices are computed for a decision threshold of 0.5.
14 Section 5. A robustness check in which OLS models are built with normalized panel density radii of 200m, 500m, and 1000m.

Section 6. A comparison of feature importances when panel density is averaged over census tract.
Most of the economic, social, and demographic variables are calculated on the census tract level, whereas the panel density surrounding an address is calculated within variable radii around an address. Therefore, the granularity of these tract-level variables is coarser than that of the density variable. To test for the effect of this granularity difference, we run the model using panel density averaged over the census tract, and find density is still the most important feature. The average tract size is 2.07 square miles. We build separate models using tract-averaged density variables calculated at all radii. The following figure shows the feature importance score for each of these tract-averaged density variables, averaged over 50 permutations. We see they have very close importance 18 scores, as would be expected, given the difference in these variables is on the margins of each tract area. Figure S9. The feature importance score for each of the tract-averaged density variables.

19
Section 7. An evaluation of feature importances without the subtraction of the previous radius.
The following figure provides the importance score for all density variables (200m to 1200m) for the AdaBoost model estimated. For these variables, the normalized panel density at the previous radius is not subtracted. Figure S10. The feature importance scores for density variables calculated at various radii where the normalized panel density at the previous radius has not been subtracted.

Section 8. An analysis of the decay in panel density importance over increasing radii when the data is subset by the number of households in the census tract and tract area.
We further explore if the relationship we observe between feature importance and panel density calculation radius is motivated by the difference in data granularity between the density and socioeconomic variables. To this end, we subdivide our data by both the number of households in each tract and tract area (Supplementary Figure S11). We first calculate the feature importances for each of three bins based on the total number of households in the tract, defining a "small" tract as having 1,000 households or less (panel a), "medium" as having between 1,000 and 1,800 households (panel b), and "large" as having more than 1,800 households (panel c). Supplementary Figure S12 shows the exponential decay curve we find for the overall dataset still provides a good representation of the decay of feature importance with an increase in density calculation radius across these subsets.  Figure S12. Feature importance scores, multiplied by 100, for each of the panel density variables calculated for models in which the data is split by census tracts with less than 1,000 total households, between 1,000 and 1,800 total households, and more than 1,800 total households. Second, we bin our data over the area of the census tract (Supplementary Figure S13). We define the "small" bin as having an area less than or equal to 1 square mile (panel a), "medium" as having between 1 and 3 square miles (panel b), and "large" as above 3 square miles (panel c). As can be seen from Supplementary Figure S14, the exponential decay curve with a radius of 210m we found for the overall dataset is still a very good fit for the sub-selection of data even through the clarity of the signal is reduced in the smaller data sets, in particular for the "large" area grouping. This is likely due to the overall low numbers of panels (and therefore positive cases in the training data) in this subset.
22 Figure S13. Distribution of the tract area variable for all census tracts. Figure S14. Feature importance scores, multiplied by 100, for each panel density variable calculated for models in which the data is split by census tracts with a total area less than 1 square mile, between 1 and 3 square miles, and more than 3 square miles. Figure 3. Figure 3.

Table S10. P-values relevant to the variables included in
24 Section 10. Distribution of household income in the dataset, by which the data is subset. Figure S15. Distribution of the median household income variable over all census tracts.

25
Section 11. Analysis of normalized panel density importance when data is subgrouped by median home value. Figure S16. Feature importance of the normalized panel density variable when the dataset is split into low, medium, and high median home value groups. Low median home value is defined as less than $150,000, medium between $150,000 and $250,000, and high above $250,000. These groupings are based on the variable's distribution. 26 Section 12. Analysis of normalized panel density importance when data is subgrouped by income. Figure S17. Feature importance of the normalized panel density variable when the dataset is split into low, medium, and high income brackets.
27 Section 13. Descriptive statistics for census tract level features.