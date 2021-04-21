In recent years, the growth of residential solar photovoltaic power generation systems and programs to spur their implementation has led to both increased data on installation patterns and study of the dynamics of their uptake. Peer effects on climate-related decisions have been identified through both passive, geographic effects (e.g. I see a panel close to my house) and active, social network effects (e.g. I hear about panels via word of mouth, or have similar tendencies to others with my socio-demographic or educational background) have both been suggested as important motivators for installation1. Distinguishing between these types of effects is particularly difficult given that previous studies have been conducted on highly aggregated spatial levels (ZIP codes or census tracts), and peer effects may be confounded within these areas by the existence of similar trends within neighbourhoods that share social, demographic, or economic qualities. Research on a finer geographic resolution will allow a better understanding of the distance sufficient to induce a potential passive peer effect. We seek to answer these questions jointly: what is the more effective type of peer effect (passive or active), and what potential effect does distance have on this passive mechanism.

Closer physical proximity to solar panels has been identified as having a positive effect on implementation over a zip-code or district, and across spaces larger than 1km2,3,4,5. It has also been shown that highly localized diffusion is most influential on the likelihood of implementation, but less is known about diffusion dynamics within 1 km3,6. Previous qualitative research in Sweden has suggested that simply viewing, or living in proximity to a solar panel are of negligible influence on the likelihood of others installing panels, and that peer effects are generated through existing social networks7,8.

Both passive and active effects have been identified as having significant effects on the uptake of alternative fuel vehicles and resource conservation behaviors9,10. Mildenberger et al. provide support for the active channel, finding that high levels of political activity are an important predictor of solar panel installation, regardless of political affiliation11. Identifying those mechanisms through which climate decisions, such as solar panel installation or uptake of alternative fuel vehicles, are made, has the potential to support pro-climate decision making and help governments design effective policies at potentially lower cost.

Here, we aim to shed light on the mechanisms through which this peer effect occurs by analysing solar panel uptake at the level of actual households. This is made possible by a data set of geo-located solar panels for the city of Fresno, identified from satellite images12 (Fig. 1a). We map these data with geo-located address and school district data provided by the County of Fresno13 and Institute for Education Science14 as well as with socio-economic and demographic variables on the census tract level from the American Community Survey15 (see “Data and methods” section for detail).

Figure 1 Geolocations of solar panels and addresses in Fresno included in the analysis. In panel (a), light grey boxes indicate bounds of the aerial images in which the solar panel geolocations are marked. Each address is marked with a grey dot, and each panel geolocation is marked by a red dot. Geolocations of solar panels and bounding boxes of the aerial images these data were derived from are taken from Bradbury, et al. (2016), address data for the city of Fresno is made publicly available by the County of Fresno13. Panel (b) is a close-up illustrating the panel density calculation process, the red ‘x’ indicates the panel around which panel density is calculated, with several example radii provided in blue. Full size image

We employ feature importance analysis in conjunction with machine learning techniques to identify the most important predictor for solar panel installation, taking panel density within varying radii ranging from 200 m to 1200 m around a house into account. This enables identifying the effect of proximity over space as well as any decline in effect size within this distance. We also are able to examine the effect of proximity to a solar panel, regardless of the type of building on which it is installed. Given that our data includes both proximity to a panel and a potential social network (school district), we are able to compare the potential effects on uptake through both of these mechanisms.

We employ tree-based machine learning techniques because they do not specify the functional form of the relationships between features and the outcome, and are therefore able to capture nonlinear and non-parameterized relationships. Additionally, tree algorithms also demonstrate better performance on non-linearly separable or highly correlated data compared to parameterized modelling methods16,17. This is especially important given the nature of solar panel installation data, in which the outcome classes (having a panel or not having a panel) are highly unbalanced: homes with installed panels are far fewer than homes without. This problem can be so extreme that examples of installed photovoltaic systems can apply to less than 1% of the total buildings in the data4. Machine learning methodologies, especially tree-based algorithms, have been found to be better able to handle these types of imbalanced class problems and consistently show better performance than statistical modelling techniques16,18.

Panel density most important predictor for solar panel implementation

Specifically, we train an AdaBoost classifier to predict the likelihood of a particular address having a solar panel based on the socio-economic, demographic, and solar panel density features associated with that address (an overview of all features included in the main model can be found in Supplementary Table S1 and a discussion of all classifiers and evaluation metrics considered in the “Data and methods” section). We consider densities ranging from 200 m to 1200 m (in 100 m increments) and construct a different model for each density measurement, to avoid introducing high collinearity between the density variables constructed at different radii. Panel density is normalized by address density. Both the density of panels around a particular address and the density of addresses around that address are calculated exclusive of that address and its possible panels. Excluding an address’ own panel and calculating the density using just those other panels within the chosen radius ensures we do not induce a data leakage problem. A visualization of the radii constructed around a panel is included in Fig. 1b.

We then examine the contribution of each feature to the accurate prediction of an address’ solar panel status (having or not having a panel) by calculating the permuted importance of each feature. That is, we compare the importance of each feature to others in the model based on their contribution to the model’s performance (performance defined here by the area under the precision-recall curve, AU P-R Curve; see “Data and methods” section for detail). As the feature importance scores do not describe the direction of influence of that variable on the outcome (positive or negative), we check this direction by calculating the correlation between all variables and the outcome for all models. In addition to our primary evaluation metric, AU P-R curve, we also determine the total accuracy, the receiver operating characteristic curve (AUC ROC) as well as the confusion matrices for all classifiers (Tables S2-S4, Figures S1-S6; see Supplementary Discussion 1, and “Data and methods” section for a detailed description). Based on these metrics, we determine our preferred model.

In all models, normalized panel density at a particular address is the most important feature by a large margin, and with a calculated p-value significant at the 1% level (Supplementary Table S5). It surpasses all economic (e.g. median household income, employment status), housing (e.g. home value, owner occupied homes), demographic (e.g. racial breakdown, median age), and network (school district) variables (Fig. 2). These results are robust for all radii at which density is calculated with the feature increasing in importance as the radius is shortened. Secondarily important variables include median household income and median home value, as can be seen in the close-up panels of Fig. 2. In all models, we find that the normalized density is positively correlated with our outcome, confirming that solar panel density around an address is a positive predictor of an address having a solar panel installed (see “Data and methods” section for more detail). Model performance suffers across all models when density variables are removed (Supplementary Discussion 2; Supplementary Table S6 and Supplementary Figure S7). This further underlines the predictive power of the panel density variable to predicting the presence of a solar panel at a particular address. Applying a simple regression model further supports our main conclusion about the influence of proximal solar panel density on the likelihood of installing a panel, and its positive influence (Supplementary Tables S7-S9). Panel density also remains the most important variable in all models when averaged over census tract (Supplementary Figures S8 and S9).

Figure 2 Panel density most important feature for predicting solar panel installation. Feature importance scores by variable are shown as point value of performance contributed by each feature for all features, multiplied by 100. The solar panel densities shown are calculated at 200 m, 500 m, and 1 km around each address in the dataset (see Supplementary Table S5 for results for all 100 m increments between 200 m and 1200 m). The top panel shows the bar plot of all features in the model, the lower panel shows a zoomed view of all features excluding the density feature for each model. Across all models, panel density consistently contributes the largest gains in performance. Feature importance is highest for a radius of 200 m and decreases with radius length. Full size image

Exponential decay of panel density importance with larger radii

Comparing importance scores of density variables calculated at different radii across these radii revealed a larger influence of panels within shorter distances. To further explore how the importance of panel density decreases with distance, we calculate in a next step the normalized density of panels to addresses at each radius with the panels and addresses from the previous radius subtracted. Feature importances for density variables calculated without the subtraction of the previous radius are provided as robustness check and show qualitatively the same result (Supplementary Figure S10). The subtraction of the previous radius’ density isolates the effect of just the area of increase in radius from the previous model, containing a shorter radius. The same set of socio-economic and demographic variables are included in all models, regardless of the radius over which density was calculated.

We find that the density of panels around an address becomes a less important feature in the model as the radius over which we calculate this density grows (Fig. 3). The panel density variable with the smallest radius (200 m) has the largest contribution to model performance of all panel density features. The data follow an exponential decay with a radius of 210 m. This indicates that those buildings located closest to an address most strongly predict the likelihood of an address also having a solar panel, with this peer effect decreasing exponentially as distance from the address increases.

Figure 3 Exponential decay of panel density importance with larger radii. Feature importances (presented multiplied by 100) are derived from 10 different models, each with the same set of socio-economic and demographic variables. In each model, a different radius is chosen over which density is calculated where just the increase in area compared to the next smaller radius is considered. The height of the bar indicates the improvement in model performance due to the inclusion of that density feature. The data follow an exponential curve with a decay radius of 210 m. Full size image

This relationship persists when the data are grouped either by tract area or by number of households into three bins (Supplementary Figures S11 and S13). Even though sub-selecting data reduces the clarity of the signal and therefore the performance of the classifier, we find the exponential decay function with a radius of 210 m is still a very good fit for these sub-groups in both analyses (Supplementary Figures S12 and S14). This suggests that the decay effect we see is not motivated by a correlation to tract size either by physical size or number of households, but rather a true effect of proximity.

As the radius over which density is calculated surpasses 500 m, there is a stagnation in the predictive power of the panel density feature, and it ceases to be a highly important feature, demonstrated by the decrease in the feature importance score. P-values indicate that all panel density variables across all models remain significant at a 1% level (Supplementary Table S10). These results indicate that there may be a nonlinear relationship between proximity to solar panels and the likelihood of a panel being installed at a near address, which may be bounded around 500 m.

Importance of solar panel density moderated by income

Figures 2 and 3 show that proximity to other solar panels is the most important predictor for having a solar panel installed and that this importance decreases the further away the other panels are. In order to test for potentially heterogeneous effects, i.e. whether this effect is more pronounced in some social groups than in others, we split that data in accordance with several socio-economic categories. Binning our data into several income brackets reveals a nonlinear interaction of household income and distance over which panel density is calculated (again using the differenced density calculation here) (Fig. 4). Specifically, we define three income brackets based on the distribution of this feature (see Supplementary Figure S15). We define the lowest income bracket to be less than $42,000/year, mid-income between $42,000 and $80,000/year, and high income as more than $80,000/year. Figure 4 shows the permuted importance scores across binned household incomes for five radii (200 m, 300 m, 400 m, 500 m, and 1000 m), again calculated using contribution to the area under the precision-recall curve. Across all income groups, the importance of panel density decreases with larger radii. However, the importance of high panel density at small radii (200 m to 500 m) is most pronounced in the low-income group. For larger radii (1000 m), this relation becomes insignificant. The extraction of a clear relation is further complicated by the fact that in this case high-income groups might be more likely to live in less densely populated areas. These results suggest that the relationship between distance and likelihood of having a panel is moderated by income. Splitting by median home value shows comparable effects across different groups (see Supplementary Figure S16).