A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata

Photovoltaic (PV) energy generation plays a crucial role in the energy transition. Small-scale, rooftop PV installations are deployed at an unprecedented pace, and their safe integration into the grid requires up-to-date, high-quality information. Overhead imagery is increasingly being used to improve the knowledge of rooftop PV installations with machine learning models capable of automatically mapping these installations. However, these models cannot be reliably transferred from one region or imagery source to another without incurring a decrease in accuracy. To address this issue, known as distribution shift, and foster the development of PV array mapping pipelines, we propose a dataset containing aerial images, segmentation masks, and installation metadata (i.e., technical characteristics). We provide installation metadata for more than 28000 installations. We supply ground truth segmentation masks for 13000 installations, including 7000 with annotations for two different image providers. Finally, we provide installation metadata that matches the annotation for more than 8000 installations. Dataset applications include end-to-end PV registry construction, robust PV installations mapping, and analysis of crowdsourced datasets.


Background & Summary
In 2021, photovoltaic (PV) power generation amounted to 821 TW h worldwide and 14.3 TW h in France 1 .With an installed capacity of about 633 GW p worldwide 2 and 13.66 GW p in France, PV energy represents a growing share of the energy supply.The integration of growing amounts of solar energy in energy systems requires an accurate estimation of the produced power to maintain a constant balance between demand and supply.However, small-scale PV installations are generally invisible to transmission system operators (TSOs), meaning their generated power is not monitored 3 .For TSOs, the lack of reliable rooftop PV measurements increases the flexibility needs, i.e., the ability of the grid to compensate for load or supply variability [4][5][6][7] .Estimating the PV power generation from meteorological data is common practice to overcome the lack of power measurements.However, this necessitates precise information on its installed capacity and metadata 8,9 .Detailed information regarding small-scale PV is of interest for integrating renewable energies into the grid 10 , or for understanding the factors behind its development 11 .
Currently, such PV installation registries covering large areas are neither easily available nor available everywhere.Recent research to construct global PV inventories 12,13 is limited to solar farms and does not include rooftop PV.A recent crowdsourcing effort enabled to map 86% of rooftop and rooftop PV installations 14 , but only in the United Kingdom.Other available datasets are aggregated at the communal scale (census level) 15 .
Remote sensing-based methods 13,[15][16][17] recently emerged as a promising solution to quickly and cheaply acquire detailed information on PV installations 18 .These methods rely on overhead imagery and deep neural networks.The DeepSolar initiative led to the mapping of rooftop PV installations over the continental United-States 15 or the state of North-Rhine Westphalia 19 .These remote sensing-based methods cannot scale to unseen regions without a sharp decrease in accuracy 20,21 .It is caused by the sensitivity of deep learning models to distribution shifts 22 (i.e., when "the training distribution differs from the test distribution 23 ").These distribution shifts typically correspond to acquisition conditions and architectural differences across regions 24 .The lack of robustness towards distribution shifts limits the reliability of deep learning-based registries for constructing official PV statistics 10 .Therefore, developing PV mapping algorithms that are robust to distribution shifts is necessary.
To encourage the development of such algorithms, we introduce a training dataset containing data for (i) addressing distribution shifts in remote sensing applications and (ii) helping design algorithms capable of extracting small-scale PV metadata from overhead imagery.
To address distribution shifts, we gathered ground-truth annotations from two image providers for installations located in France.The double annotation enables researchers to evaluate the robustness of their approach to a shift in data provider (which affects acquisition conditions, acquisition device, and ground sampling distance) while keeping the same observed object.Our dataset provides ground truth installation masks for 13303 images from Google Earth 25 and 7686 images from the French national institute of geographical and forestry information (IGN).To address architectural differences, researchers can either use the coarse-grained location included in our dataset or use our dataset in conjunction with other training datasets that mapped different areas (e.g., Bradbury et al. 26 or Khomiakov et al. 27 ).
To extract PV systems' metadata, we release the installation metadata for 28807 installations.This metadata includes installed capacity, surface, tilt, and azimuth angles, sufficient for regional PV power estimation 8 .We linked the installation metadata and the ground truth images for 8019 installations.To the best of our knowledge, it is the first time a training dataset contains PV panel images, ground truth labels, and installations' metadata.We hope this data contributes to the ongoing effort to construct more detailed PV registries.
We obtained our labels through two crowdsourcing campaigns conducted in 2021 and 2022.Crowdsourcing is common practice in the machine learning community for annotating training datasets 28,29 .We developed our crowdsourcing platform, and we were able to collect up to 50 annotations per image to maximize the accuracy of our annotations.Besides, multiple annotations per image facilitate measurement of the annotator's agreement or limit their individual annotations biases.Indeed, we found that some annotators were more cautious when annotating than others.We make the raw crowdsourcing data publicly available.It enables the replication of our annotations, but we also hope this will help research crowdsourcing, e.g., on the efficient combination of labels 30 .
Our dataset targets practitioners and researchers in machine learning and crowdsourcing.Our data can serve as training data for remote PV mapping algorithms and test machine learning models' robustness against acquisition conditions shift.Additionally, we release the raw annotation data from the crowdsourcing campaigns for the community to carry out further studies on the fusion of multiple annotations into ground truth labels.The training dataset and the data coming from the crowdsourcing campaigns are accessible on our Zenodo repository 31 .

Methods
We illustrate our training dataset generation workflow in Figure 1.It comprises three main steps: thumbnails extraction, annotation of solar arrays, and metadata matching.

Annotation of solar arrays
We extracted thumbnails based on the geolocation of the installations recorded in the BDPV dataset.However, this geolocation can be inaccurate, so before asking users to draw polygons of PV installations, we asked them to classify the images.It corresponds to the first phase of the annotation campaign.Once users classified images, we asked users to draw the PV polygons on the remaining images.It corresponds to the second phase of the crowdsourcing campaign.We designed our campaign to get at least five annotations per image.It enabled us to derive metrics, which we term as consensus metrics, targeted at maximizing the quality of our labels.This way, we go further than the consensus between two annotators reported in previous work 26 to measure annotation quality.The analysis of the users' annotations during phases 1 and 2 are reproducible using the notebook annotations available on the public repository.

Phase 1: image classification
During the first phase, the user clicks on an image if it depicts a PV panel.We recorded the localization of the user's click and instructed them to click on the PV panel if there was one.We collected an average of 10 actions (click with localization or no click) per image.The left panel of Figure 2 provides an example of annotations during phase 1.We apply the kernel density estimate (KDE) algorithm to the annotations to estimate a confidence level for the annotations and the approximate localization of the PV panel on the image.The likelihood f σ (x i ) of presence of a panel for each pixel x i is given by: where K σ is a Gaussian kernel with a standard deviation σ , x k is the coordinate of the k th annotation, and N is the total number of annotations.
After an empirical investigation, we calibrated the standard deviation of the kernel to reflect the approximate spatial extent of an array on the image.We set its value to 25 pixels for Google images and 12 for IGN images.It corresponds to a distance of 2.5 m.As illustrated on Figure 2, the KDE yields a heatmap whose hotspot locates on the solar array.The maximum value of the KDE quantifies the confidence level of the annotation.We refer to it as the pixel annotation consensus (PAC).This metric is proportional to the number of annotations.We use the PAC to determine whether an image contains an array.

Phase 2: polygon annotation
During the second phase, annotators delineate the PV panels on the images validated during phase 1. Users can draw as many polygons as they want.On average, we collected five polygons per image.We collect the coordinates of the polygons drawn by the annotators.As illustrated in the lower left panel of Figure 2, a set of polygons is available for each array in an image.We can note from the annotation illustrated in Figure 2 that some polygons may be erroneous.However, these false positives have fewer annotations than true positives.To select only the true positives, we compute the PAC through the following steps:

Metadata matching
Once we generate our segmentation masks, we match them with the installations' metadata reported in the BDPV dataset.Our matching procedure follows three steps: internal consistency, unique matching, and external consistency.Note that we only apply these filters when matching the metadata and the masks.
Internal consistency ensures that the entries in the BDPV dataset are coherent before any matching.It is simply a cleaning of the raw dataset.To do this cleaning, we verify whether the information in one column is coherent with the records from the other columns.For instance, if a system's record says it has ten modules and a surface of 3 squared meters, this would mean that each PV module has a surface of 0.3 squared meters, which is impossible (the smallest size being 1.7 squared meters).
Unique matching Our segmentation masks may depict more than one array.It occurs if, for instance, more than one panel is on the image shown to the annotators.In this case, we adopt a conservative view: if the segmentation mask depicts more than one panel, we cannot know which corresponds to the installation reported in the BDPV dataset.We do not match the segmentation mask with an installation in this case.
External consistency After internal consistency filtering and unique matching, only segmentation masks depicting only one panel whose metadata is coherent remain.A final filtering step consists in making sure that the characteristics reported in the database match those that can be deduced from the segmentation mask.We assess the adequacy between the surface of the installation's mask and its true surface reported in the BDPV dataset by computing the ratio between them.We keep only installations whose ratio is equal to 1 (with a tolerance bandwidth of ± 25%).We apply this bandwidth to accommodate the possible approximations in the segmentation mask.The reported surface excludes the inter-panel space and the distortions induced by the panel's projection on the image, as images are not perfectly orthorectified.

Data Records
The data records consist of two separate datasets, accessible on our Zenodo repository 31 , at this URL: https://zenodo.org/record/7358126.

1.
The training dataset: input images, segmentation masks, and PV installations' metadata, 2. The crowdsourcing and replication data: annotations from the users, for each image, provided in .jsonformat and the 5/14 raw installations' metadata.
Besides, the source code and notebooks used to generate the masks from the users' annotations are accessible on our public Git repository at this URL: https://git.sophia.mines-paristech.fr/oie/bdappv.This repository contains the source code used to generate the segmentation masks.It contains the notebooks annotations and metadata, which can be used to visualize the threshold analysis or the metadata matching procedure.

Training dataset
The training dataset containing RGB images, ready-to-use segmentation masks of the two campaigns, and the file containing PV installations' metadata is accessible on our Zenodo repository.It is organized as follows: • bdappv/ Root data folder google / ign One folder for each campaign metadata.csvThe .csv file with the metadata of the installations.Table 7 describes the attributes of this table.

Crowdsourcing and replication data
The Git repository contains the raw crowdsourcing data and all the material necessary to re-generate our training dataset and technical validation.It is structured as follows: the raw subfolder contains the raw annotation data from the two annotation campaigns and the raw PV installations' metadata.The replication subfolder contains the compiled data used to generate our segmentation masks.The validation subfolder contains the compiled data necessary to replicate the analyses presented in the technical validation section.
• data/ Root data folder raw/ Folder containing the raw crowdsourcing data and raw metadata; * input-google.json:Input data containing all information on images and raw annotators' contributions for both phases (clicks and polygons) during the first annotation campaign; * input-ign.json:Input data containing all information on images and raw annotators' contributions for both phases (clicks and polygons) during the second annotation campaign; * raw-metadata.csv:The file containing the PV systems' metadata extracted from the BDPV database before filtering.It can be used to replicate the association between the installations and the segmentation masks, as done in the notebook metadata.Table 6 describes the attributes of the raw-metadata.csvtable.
replication/ Folder containing the compiled data used to generate the segmentation masks; * campaign-google / campaign-ign.One folder for each campaign • click-analysis.json:Output on the click analysis, compiling raw input into a few best-guess locations for the PV arrays.This dataset enables the replication of our annotations; • polygon-analysis.json:Output of polygon analysis, compiling raw input into a best-guess polygon for the PV arrays.
validation/ Folder containing the compiled data used for technical validation.
* campaign-google / campaign-ign.One folder for each campaign • click-analysis-thres=1.0.json:Output of the click analysis with a lowered threshold to analyze the effect of the threshold on image classification, as done in the notebook annotations; • polygon-analysis-thres=1.0.json:Output of polygon analysis, with a lowered threshold to analyze the effect of the threshold on polygon annotation, as done in the notebook annotations.
accurate masks and that its value should be 0.45.In other words, we consider that a pixel depicts an installation if at least 45% of the annotators included it in their polygons.
The center plot of Figure 3 depicts the histogram of the relative PAC.Visual inspection revealed that the few values below 0.45 corresponded to remaining false positives (e.g., roof windows).The use of a relative threshold is motivated by the fact that the users can annotate as many polygons as they want.We enable replication of the threshold analysis in the notebook annotation.

Consistency between annotations and metadata of the PV installations
We link segmentation masks and annotation metadata according to the steps described in the section "Metadata matching ."To measure the quality of this linkage, we measure the Pearson correlation coefficient (PCC) between the surface reported in the installation' metadata dataset (referred to as the "target" surface) and the surface estimated from the segmentation masks (referred to as the "estimated" surface).The higher the PCC, the better our matching procedure.
Figure 3 plots estimated and target surfaces.After filtering, we obtain a PCC coefficient of 0.99 between the target and estimated surfaces.Without filtering, the PCC coefficient equals 0.68 for Google images and 0.61 for IGN images.It shows that our metadata-matching procedure enabled us to pick the installations with the best fit between the observable metadata and masks.
Our matching procedure comprises three steps: internal consistency, unicity and external consistency.Each of these steps discards installations from the BDPV database.Table 2 summarizes the number of installations filtered at each process step.We can see that most of the filtering happens when we discard segmentation masks on which there is more than one installation.Google

Usage Notes
We designed the complete dataset records to be directly used as training data in machine learning projects.The ready-to-use data is accessible on our Zenodo repository accessible at this URL https://zenodo.org/record/7358126.This repository also stores the raw crowdsourcing data and the files necessary to reproduce our segmentation masks and analyses.We compiled the files click-analysis.jsonand polygon-analysis.jsonusing the Python scripts click-analysis.pyand polygon-analysis.py,provided in our repository from the raw input data.This repository also contains the notebooks annotations and metadata.The notebook annotations presents the analysis of crowdsourced data from the crowdsourcing campaigns.The notebook metadata filters the raw-metadata.csvdatasheet.
Between phases 1 and 2, we generated new thumbnails re-centered on the PV installations.The new center corresponds to the coordinates of the estimated center of the (first in the list) detected PV installation.Therefore, to replicate the click analysis on the corresponding image, interested users need to download the corresponding image accessible on the BDAPPV website as illustrated in the notebook annotations.We re-center images by generating a new thumbnail centered around an updated location, according to the procedure described in the section "methods." The centering of the images will not induce a bias during learning because our thumbnails have a larger resolution (400 × 400) than the typical input size of typical neural networks (224 × 224).Adding a random crop transform during training will result in panels not being centered anymore.Besides, during the IGN campaign, we only re-centered about 13% of the images.

Figure 1 .
Figure 1.Flowchart of the training dataset generation based on the BDPV PV data and crowdsourcing."GSD" stands for the ground sampling distance, i.e., the distance between the centers of two adjacent pixels measured on the ground.

Figure 2 .
Figure 2. Screenshot of the annotations notebook, showing analysis of click annotations (phase 1, above) and polygon annotations (phase 2, below).During phase 1 (above), each red dot corresponds to an annotation.The density of annotations is greater near one of the panels, but we also see that other panels also received clicks.

Figure 3 .
Figure 3. Validation by comparison of the surface estimated from the masks and the surface reported in the PV installations' metadata.

Table 1 .
Summary statistics of the contributions during the crowdsourcing campaigns.

Table 2 .
Number of installations filtered through the different filtering steps during the association between the masks and the installations' metadata.