Global offshore wind turbine dataset

Offshore wind farms are widely adopted by coastal countries to obtain clean and green energy; their environmental impact has gained an increasing amount of attention. Although offshore wind farm datasets are commercially available via energy industries, records of the exact spatial distribution of individual wind turbines and their construction trajectories are rather incomplete, especially at the global level. Here, we construct a global remote sensing-based offshore wind turbine (OWT) database derived from Sentinel-1 synthetic aperture radar (SAR) time-series images from 2015 to 2019. We developed a percentile-based yearly SAR image collection reduction and autoadaptive threshold algorithm in the Google Earth Engine platform to identify the spatiotemporal distribution of global OWTs. By 2019, 6,924 wind turbines were constructed in 14 coastal nations. An algorithm performance analysis and validation were performed, and the extraction accuracies exceeded 99% using an independent validation dataset. This dataset could further our understanding of the environmental impact of OWTs and support effective marine spatial planning for sustainable development. Measurement(s) offshore wind turbine Technology Type(s) satellite imaging • digital curation Factor Type(s) temporal interval • spatial extent Sample Characteristic - Environment wind farm • atmospheric wind Sample Characteristic - Location global Measurement(s) offshore wind turbine Technology Type(s) satellite imaging • digital curation Factor Type(s) temporal interval • spatial extent Sample Characteristic - Environment wind farm • atmospheric wind Sample Characteristic - Location global Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14865690

(EMODnet) wind farm database 19 and Open Power System Data (OPSD) renewable power plant database 13 (refer to details in Online-only Table 1). Among these regional/national databases, the USWTD and OPSD provide the exact OWTs location. Although they do not have global coverage and updates of the latest installations, part of these data, such as the USA (USWTD) and Germany (OPSD) wind farms data, have been validated and can be referenced. Other databases, such as EMODnet, provide the number of turbines and spatial boundaries or the centroids of wind farms but lack information on their precise locations, while the UKREPD also suffers from inaccuracy of location, and only has an approximate centroid for each offshore wind farm. Therefore, to date, no global OWT dataset with accurate geographic turbine location information is available in the public domain.
Satellite imagery is an important source of information for the identification of OWTs. However, widely utilized passive optical images (i.e., Landsat 30-m resolution images) are often affected by clouds and mist over coastal zones, which makes it difficult to map wind turbines 20 . In contrast, SAR data from the Sentinel-1A/B satellite, which was launched by the European Space Agency in 2014, can collect information regardless of cloud cover, day or night and can be used to identify OWT objects, in which the presence of dihedral structures results in a drastic increase in backscattering 21 .
In this study, we build a global OWT dataset by applying a percentile-based yearly SAR image collection reduction and autoadaptive threshold algorithm on the Google Earth Engine (GEE) platform using more than 737,100 Sentinel-1 SAR images. A method performance analysis, validation assessment and accuracy analysis were performed using Google high-resolution imagery, multisource optical and radar satellite image data (i.e., Landsat 8-OLI, Sentinel 2-MSI, Sentinel 1 data), ground unmanned real-time kinematic (RTK) drone investigation and other datasets (i.e., 4 C Offshore, USWTD, UK REPD, OPSD, EMODnet). Figure 1 depicts the data acquisition and processing steps using a flow diagram. Compared to the offshore wind farm dataset extracted or validated by aerial imagery, the wind turbine number obtained by our global OWF dataset will not be underestimated since available Sentinel 1 data do not lag actual installations by several months. Therefore, this dataset can also be used to analyse regional variations in OWFs, prioritize OWF planning, and assess their potential environmental impacts. The global OWF dataset will be updated annually and is currently free to download via Figshare 22 .

Methods
The global OWT dataset was developed by using geospatial technology and advanced mathematical operations on the GEE platform using earth observation Sentinel 1 SAR time-series imagery. These operations were performed to map the spatial distribution of individual OWT in the global coastal zone. Spatial extent. The spatial extent of OWTs covers the global offshore area in each exclusive economic zone (EEZ) 23 . The EEZ database provides the maritime boundary prescribed by the 1982 United Nations Convention on the Law of the Sea over which a sovereign state has special rights regarding the exploration and use of marine resources. Based on this database, the extraction of OWTs was organized into 0.5° × 0.5° vector grids for the global coast. The main reason for this step was to reduce the computational memory of remote servers on the GEE platform as well as to select a systematic geographic extent for this study. SAR image processing. SAR images were collected and processed on the GEE platform. Imagery in the GEE 'COPERNICUS/S1_GRD' Sentinel-1 image collection consists of Level-1 Ground Range Detected (GRD) scenes, which process the backscatter coefficient (σ°) in decibels (dB). Each scene in GEE was preprocessed with the Sentinel-1 Toolbox using the following steps: (1) application of an orbit file that updates the orbit metadata with a restituted orbit file; (2) removal of low-intensity GRD border noise and invalid data on scene edges; (3) thermal noise removal, which eliminates additive noise in subswaths to help reduce discontinuities between subswaths for scenes in multiswath acquisition modes; (4) radiometric calibration, which computes the backscatter intensity using the sensor calibration parameters in the GRD metadata; and (5) terrain correction using Shuttle Radar Topography Mission (SRTM) or Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) digital elevation model (DEM) products. This procedure basically converts data from ground range geometry. The concluding terrain-corrected figures are transformed to decibels through log scaling (10*log10(x)).
In this study, Sentinel-1 imagery from interferometric wide (IW) swath mode and in vertical-vertical (VV) polarization is selected for the analyses. This configuration was selected because it is more effective in detecting offshore emissions, as shown in Fig. 2, than other configurations. We selected three regions of interest for three types of objects in the offshore areas of the East China Sea and the North Sea, including tidal flats, open water and OWTs. The histogram distribution of the digital number (DN) values in the near-infrared band of the Sentinel-2 MultiSpectral Instrument (MSI) and the backscattering coefficients in the Sentinel-1 VV and vertical-horizontal (VH) polarization bands of these regions are compared. The results showed that the backscattering coefficients of wind turbines in the Sentinel 1 VV band have higher separability when distinguishing them from open water and tidal flats. From Fig. 2, it is obvious that if the maximum backscatter coefficient is less than 0 dB in a particular grid, then this grid does not contain a wind farm. Therefore, we can directly exclude some grids from the analysis according to the following criterion in Eq. (1): www.nature.com/scientificdata www.nature.com/scientificdata/ 1) Removal of floating or temporarily mobile objects Taking advantage of the Sentinel-1 time-series data, advanced statistical analysis was applied to the composite images. After preprocessing the Sentinel-1 data and storing them as an 'ImageCollection' , we filtered them by a date range and spatial boundary to obtain an annual composite of 'VV' images for each selected grid. The percentile and mean values of a series are commonly used in statistical measures that we applied to identify floating or temporarily mobile objects based on the frequency of appearance in the image. We then removed floating or temporarily mobile objects, such as ships and vessels, by comparing their mobility with stable objects, such as OWTs. The percentile and interval mean values between 80-100% were applied to the features in the series using the 'intervalMean()' reduction method on the GEE platform.

2) Extraction of high-backscatter objects
Selection of an optimal threshold value is the most important step in object extraction. However, because of the variability in global ocean water on the SAR backscatter coefficient, it is necessary to apply an autoadaptive threshold to different ocean regions. The histogram for a grid without wind turbines generally has one peak in the lower values (water body) and no peak in the higher values (OWTs usually have values greater than 0 dB), which can be reflected from the median of the lowest and highest values. We used a grid-based backscatter filter Comprehensive …

Method performance assessment
Accuracy assessment Step 1: Spatial extent www.nature.com/scientificdata www.nature.com/scientificdata/ with an automatic adaptive threshold (T) to distinguish high-backscatter objects from the different ocean water backgrounds having low backscatter values. The threshold is defined as the median of the lowest and highest values, termed here the 'half min-max threshold' . We then obtained a binary image based on the comparison of the backscatter coefficient (Eq. (2)) with T (Eq. (3)), and the equation is defined as follows: where T is the dynamic threshold, BC is the backscatter coefficient of each pixel in the grid, BC max is the backscatter maximum, and BC min denotes the minimum value in the grid.

3) Morphological operation
Because the binary images produced by the previous step are distorted by noise and textures, a morphological analysis was employed to enhance the high-backscatter image objects. Morphological processing methods for erosion and dilation can correct these distortions by accounting for the form and structure of the image. Both erosion and dilation processing techniques are a collection of nonlinear operations related to the shape or morphology of features in an image. The value of the output pixel for dilation is the maximum value of all the pixels in the neighbourhood, which makes objects more visible and fills in small holes in the objects. The value of the output pixel for erosion is the minimum value of all the pixels in the neighbourhood, which removes islands and small objects so that only substantive objects remain. www.nature.com/scientificdata www.nature.com/scientificdata/

4) Removal of large and minute objects
Knowing the number of pixels in an object can be helpful for masking irrelevant objects of different sizes. An area-size-range filter algorithm (20 < number of pixels < 200) was used to eliminate large and very small objects such as islands, oil platforms and small noise objects. In the GEE platform, the 'connectedPixelCount()' method was used to compute the number of pixels in each object.

5) Post-processing of data records
We converted the raster to the vector data type (using the 'image.reduceToVectors()' method in the GEE platform) and obtained the latitude and longitude coordinates for the individual wind turbines. As OWTs are constructed, the backscatter coefficient increases rapidly, and hence, the information about the installation dates of the wind turbine foundations can also be extracted from yearly 'VV' images. www.nature.com/scientificdata www.nature.com/scientificdata/ and UB k is the inverse value (from December 2019 to January 2015). When an intersection occurs between UF k and UB k, the value falls within the 95% confidence interval (U 0.05 = ±1.96), and then the corresponding times of the intersection are considered the installation dates of the wind turbine foundations. This operation was carried out in MATLAB.

Data Records
This dataset provides geocoded information about global OWTs from 2015-2019; it identified 6,924 wind turbines that comprise more than 10 nations. Data are available at 10 m spatial resolution, providing an explicit dataset for planning, monitoring and managing marine space. Global OWT dataset are publicly available for download from Figshare 22 and can be visualized at https://arcg.is/0zu09X using an active ArcGIS online account.
The global OWT dataset is referenced to the WGS84 datum and stored in Shapefile (.shp) format. Each record consists of seven attributes: centroid latitude (centr_lat), centroid longitude (centr_lon), continent, country, sea area (sea_area), appearance year (occ_year) and month (occ_month). Description of these are tabulated in Table 1.

Technical Validation
Method performance assessment. OWTs extraction is subject to uncertainties that arise from various background factors in the analysis grid, including tidal flats, turbidity of water bodies, and floating or temporarily mobile objects. Thus, to assess whether the extraction method has a high performance and whether the OWTs result outputted from GEE has a strong stability, we perform a sensitivity analysis of the wind turbine extraction against increasing SAR images to reveal that the amount of SAR image data that we utilized is enough to ensure the stability of the extracted results with various background factors. By calculating the precision (P) (Eq. (4)), which refers to the extracted real wind turbine number relative to the total extracted wind turbine number, recall (R) (Eq. (5)), which refers to the extracted real wind turbine number relative to the total real wind turbine number, and the comprehensive evaluation index (C) (Eq. (6)), which integrates the P and R value, we quantitatively evaluate the robustness of the extraction method. www.nature.com/scientificdata www.nature.com/scientificdata/ Using 1 to 40 images in the 2019 SAR image collection, Fig. 5 displays two examples of the extracted accuracy change for turbid water bodies and tidal flat backgrounds along the Shanghai coast (Shanghai Lingang Demonstration Wind Farm) and the Jiangsu coast (Jiangsu Rudong Offshore Intertidal Demonstration Wind Farm), China. The results reveal that the comprehensive evaluation index of the extracted wind turbines increased from 21.88% to 99.10% when 15 images were applied to the Shanghai Lingang Demonstration Wind Farm and increased from 83.78% to 99.04% when 20 images were applied to the Jiangsu Rudong Offshore Intertidal Demonstration Wind Farm. Since the Sentinel-1 satellite has a 12-day or 6-day revisit cycle, our analysis results indicate that using an annual average backscattering coefficient (covering more than 20 images) for OWTs extraction can ensure an extraction accuracy greater than 99% regardless of the background.   www.nature.com/scientificdata www.nature.com/scientificdata/ = × × + P R P R C 2 (6) where TP is the number of accurately identified wind turbine objects, FP is the number of falsely identified wind turbine objects, and FN is the omission number of wind turbine objects.
Accuracy assessment. Validation of the global dataset was conducted using an independent accuracy assessment approach. Here, we generated a validation set that consisted of 50 random offshore wind farms, covering 2,663 wind turbines using three methods. Reference data include (1) the high-resolution aerial imagery and Google images; (2) the comparison and corroboration across multiple source datasets, including the OWTs in the 4 C Offshore 17 , USGS USWTD 15 , UK REPD 16 , EMODnet 19 , OPSD 13 and GBWSF 14 databases; and (3) a comprehensive visual examination and an extensive internal review by the authors using Sentinel 2-MSI data or Landsat 8-OLI imagery with true colour composition and Sentinel 1 data after floating or temporarily mobile object removal. The use of aerial imagery for verification was conducted for October 2019. One offshore wind farm on the Jiangsu coast, China, covering 155 wind turbines, was validated by unmanned aerial vehicle (UAV) aerial photography images collected by a ground unmanned real-time kinematic (RTK) drone. All the photography images have geographic information, and Fig. 6 shows the specific location information of two wind turbines in that large wind farm. Furthermore, six other wind farms in China were also cross-validated with Sentinel 2-MSI data, Landsat 8-OLI imagery and Google high-resolution imagery in Google Earth (Fig. 7). In addition, 43 wind farms in North America and Europe covering six countries were selected, referenced and cross-validated using different national/international dataset sources. All the wind turbines (covering 50 wind farms) in the validation dataset were double examined for visual inspection using the Sentinel 1 data. Specifically, two authors who had sufficient backgrounds in remote sensing and GIS separately obtained these data source images country by country from the GEE platform and cross-validated the position and number of OWTs. The validation dataset is also publicly available for download from Figshare 22 . www.nature.com/scientificdata www.nature.com/scientificdata/ The use of three methods to generate the independent validation dataset was motivated by the lack of a consistent set of global outvalidation data of OWTs for accuracy assessment. To report the precision metric, we calculated the ratio of accurately identified wind turbine objects to all detected objects in our dataset.  Table 1). The identification error is attributed to the met mast and offshore substation located inside or near the wind turbines, which are extracted with the wind farm, such as the OWFs in EnBW Baltic 2 and Arkona, Germany. To calculate recall, we subtracted falsely detected objects from all detected objects and divided them by all instances (using the data in the validation dataset). As expected, all recall values reach 100%, meaning that there are no omission wind turbines in the validated areas (Online-only Table 1).
Therefore, our validation shows that studies that use this OWT dataset need to note the purpose that these data serve. If the met mast and offshore substation near the wind farm do not matter, then this dataset has an acceptable accuracy. Compared to other databases that only provide approximate spatial location information and turbine numbers 16,19 or incomplete, inaccurate information 14 , our dataset has a high resolution spatiotemporally. A visual comparison (Fig. 8) with the GIS OWTs data in the GBWSF, USWTD, EMODnet, and OPSD datasets can also confirm that our dataset has good coverage and high location accuracy and can further complement other databases as a consistent set of globally full coverage and high credibility OWT datasets.

Usage Notes
The dataset derived from satellite imagery provides the spatiotemporal distribution of global OWTs from 2015 to 2019. This dataset has the potential to further elucidate the impact of OWTs on coastal ecosystems, support biodiversity conservation and environmental impact assessments, and help generate sustainable development strategies for offshore wind energy.
We take no responsibility for any third-party use or analysis of the data, nor do we endorse any third-party opinions or conclusions reached using these data. We also ask that users notify the authors of any errors or omissions identified in the data so that they can be corrected.

code availability
All the code and processing scripts used to produce the results of this paper were written in GEE, MATLAB. Links to scripts and data for analyses can be found in the GitHub repository at https://github.com/tzhang-edu/GOWT.