Solar and wind power data from the Chinese State Grid Renewable Energy Generation Forecasting Competition

Chen, Yongbao; Xu, Junjie

doi:10.1038/s41597-022-01696-6

Download PDF

Data Descriptor
Open access
Published: 21 September 2022

Solar and wind power data from the Chinese State Grid Renewable Energy Generation Forecasting Competition

Scientific Data volume 9, Article number: 577 (2022) Cite this article

8727 Accesses
19 Citations
Metrics details

Subjects

Abstract

Accurate solar and wind generation forecasting along with high renewable energy penetration in power grids throughout the world are crucial to the days-ahead power scheduling of energy systems. It is difficult to precisely forecast on-site power generation due to the intermittency and fluctuation characteristics of solar and wind energy. Solar and wind generation data from on-site sources are beneficial for the development of data-driven forecasting models. In this paper, an open dataset consisting of data collected from on-site renewable energy stations, including six wind farms and eight solar stations in China, is provided. Over two years (2019–2020), power generation and weather-related data were collected at 15-minute intervals. The dataset was used in the Renewable Energy Generation Forecasting Competition hosted by the Chinese State Grid in 2021. The process of data collection, data processing, and potential applications are described. The use of this dataset is promising for the development of data-driven forecasting models for renewable energy generation and the optimization of electricity demand response (DR) programs for the power grid.

Measurement(s)	renewable energy generation
Technology Type(s)	supervisory control and data acquisition system
Sample Characteristic - Location	China

Inherent spatiotemporal uncertainty of renewable power in China

Article Open access 04 September 2023

CarbonMonitor-Power near-real-time monitoring of global power generation on hourly to daily scales

Article Open access 17 April 2023

EWELD: A Large-Scale Industrial and Commercial Load Dataset in Extreme Weather Events

Article Open access 11 September 2023

Background & Summary

The usage of renewable energy is increasingly important to reduce carbon emissions and protect our environment. Currently, renewable energy penetration in the grid is increasing worldwide. The power supply must simultaneously match the demand; otherwise, power imbalance problems occur in the power grid. These problems hinder the continuous development of renewable energy¹, and overgeneration problems occur^2,3. As renewable energies such as solar energy and wind power are intermittent energy resources, it will be difficult for these energy sources to fully replace fossil energy in the foreseeable future. Energy storage and demand response (DR) are two promising technologies that can be utilized to alleviate power imbalance problems and provide more renewable energy in the power grid in the future⁴.

Despite implementing DR or designing an energy storage system, an accurate forecasting model for renewable energy generation is crucial to optimize the power system and allow more renewable energies to penetrate into the grid⁵. Without accurate and reliable forecasting of renewable energy generation, the maximum benefits from the energy management system cannot be realized. Usually, renewable energy generation forecasting can be categorized into four types based on the time horizon, i.e., very short term (less than 30 min), short term (30 min-6 h), medium term (6–24 h) and long term (1–7 d)⁶. However, unlike forecasting the electrical consumption of a building, which is generally regular, forecasting renewable energy generation is notoriously difficult due to energy generation variability, which, according to previous studies, is deeply influenced by meteorological conditions^7,8. Data-driven models such as machine learning algorithms have been well recognized in the field of big data science to deduct nonlinear relationships between independent and dependent variables⁹. Therefore, researchers have spent much effort on developing machine learning models. Machine learning algorithms such as generative adversarial networks (GANs), convolutional neural networks (CNNs), long short-term memory (LSTM) and ensemble methods are widely used^8,10. GANs have been considered the most efficient algorithm to capture the intermittency and fluctuation characteristics of wind and solar energy generation in recent years^11,12. GANs is a promising architecture in renewable scenarios generation, owing to the ability to avoid complex feature extraction and cumbersome manual labeling process that are required in the conventional data-driven model¹². Furthermore, GANs can effectively depict the inherent stochastic and dynamic characteristics of renewable resources with no need for statistical assumptions. All in all, GANs leverages the capabilities of deep learning and the power of data-driven techniques to address the difficulty of scenario generation.

The amount and quality of the dataset is the fundamental factor in the development of a data-driven forecasting model. Figure 1 shows the main diagram of developing a data-driven model for wind energy generation forecasting. Generally, there are two types of original datasets: simulated datasets and on-site collected datasets. The NREL Wind Integration Dataset is a widely used dataset¹³, and it provides simulated wind data from more than 126,000 land-based and offshore wind power production sites with a 2-km grid over the United States at a 5-min resolution. Datasets derived by analyzing satellite imagery are also common and effective. Through this method, a large-scale (i.e., city- or country-scale) dataset can be obtained. Simulated datasets are usually based on assumptions that are not always in accordance with real situations. On-site measurements are usually more accurate, and they are also more appropriate for the development of forecasting modes for a specific location. However, these data are difficult to collect. Agee et al. reported over six years of solar energy production data at a 1-hour resolution from a residential building (328 m²) in Virginia, USA¹⁴. Zhang et al. presented the global offshore wind turbine dataset¹⁵. There is a platform called OpenStreetMap that is used to recreate new versions of wind and solar installation datasets¹⁶. Solar radiation information is an indispensable parameter in analyzing solar generation. Jiang et al. presented a twelve-year (2007–2018) hourly dataset with 5-km resolution of surface and diffuse solar radiation in China¹⁷. Furthermore, more dataset repositories can be found in the review in⁸.

Although some solar and wind generation datasets have been made publicly available, few of them have focused on on-site wind farms and solar stations. Compared with simulated datasets, the on-site dataset is more meaningful for the development of a good generalization model. In developing a data-driven model to forecast renewable energy generation, feature variables such as wind speed and direction, solar irradiance and temperature are important variables used to train and validate the model. The motivation of this paper is to provide an on-site collected dataset for a better understanding of renewable energy generation characteristics, which are influenced by meteorological conditions and system parameters. Therefore, data-driven models can be developed using the dataset. This dataset was collected from six wind farms and eight solar stations in China. Based on this approach, solar and wind power forecasting models can be conveniently trained and validated.

Methods

Wind farms and solar stations are generally equipped with a supervisory control and data acquisition (SCADA) system that connects hardware and software for monitoring, controlling and analyzing processes such as data visualization, alarm function, fault detection and emergency offload. A detailed introduction of the SCADA system can be found in¹⁸. The data of these six selected wind farms and eight solar stations were collected using SCADA systems. The facilities’ basic information and the nominal output capacity are listed in Tables 1, 4. The sensor architecture of the monitor systems for wind farms and solar stations are presented in Fig. 2 and Fig. 3, respectively. Data were accessed through the remote monitor platform and downloaded as.xlsx files by the authorized owner. The nominal power output capacity of these selected wind farms ranged from 36 MW to 200 MW, and the capacity of these selected eight solar stations ranged from 30 MW to 130 MW.

Table 1 Basic information on the wind turbines of each wind farm, which includes the wind turbine model and number and detailed information.

Full size table

To cover different climate zones and geographic locations, the selected solar stations and wind farm sites included areas in North, Central, and Northwest China, and the terrain included deserts, mountains and plains. It should be noted that all the original datasets were obtained and provided by a third-party, the Chinese State Grid, and the data collection process was out of the authors’ control.

Data Records

In this section, the data types and the structure of the dataset, which can be downloaded from Figshare¹⁹ or GitHub (https://github.com/Bob05757/Renewable-energy-generation-input-feature-variables-analysis), are described. In the following subsections, the solar and wind data files are presented to guide users. There are two folders in the data repository; one is the folder that contains the original data with no data preprocessing, and the other folder contains data that was preprocessed based on the methods in The processing of the missing data and outliers subsection.

Wind power generation

Wind power generation data are in the wind_farms folder, which includes six Microsoft Excel files. The real-time power generation and weather conditions are recorded in these files. The basic information about each wind farm is listed in Table 1.

In each Excel file, two years (2019–2020) of data, which included on-site weather conditions and power generation, with a time granularity of 15 minutes were recorded. Table 2 describes the meaning of the column headings. The wind speed at different height levels was recorded, and the speed at the wheel hub of the wind turbine was the most important factor for predicting power generation.

Table 2 Description of the feature variables.

Full size table

The statistics of each wind farm can be seen in Table 3. The nominal wind generation capacity varied from 36 MW to 200 MW, and the average real output ranged from 6.7 MW to 72.7 MW. The wind speed at the height of the wheel hub varied from 0 m/s to 36.9 m/s, and the yearly average was approximately 6.0 m/s. The air temperature varied from −24.5 °C to 37.6 °C, and the yearly average was 8.5 °C. Weather conditions at different height levels showed a similar trend. Generally, the wind speed was seasonal, showing higher speeds during summertime and lower speeds during wintertime.

Table 3 Statistics of the wind farms.

Full size table

Solar energy generation

Solar power generation data are in the solar_stations folder, which includes eight Excel files. The weather condition data and real-time power generation data were recorded in these files. The power generation and PV panel information of each solar station are listed in Table 4. Similar to the wind generation dataset, two years (2019–2020) of data with a time granularity of 15 minutes were recorded. Table 2 describes the meaning of column headings. The nominal solar generation capacity varied from 30 MW to 130 MW, and the average real output ranged from 4.2 MW to 29.8 MW. The statistics of each solar station can be seen in Table 5.

Table 4 Power generation and PV panel information of each solar station, which includes the solar panel model and number and detailed information.

Full size table

Table 5 Statistics of solar stations.

Full size table

Technical Validation

In this section, the visualization of the data, which includes the processing of missing data, outliers, and correlation analysis of the influencing feature variables, is presented to clarify the data quality.

The processing of the missing data and outliers

The missing data include variables that were zero, null, ‘NA’, ‘0.001’, ‘−99’, and ‘–’. The outliers included weather variables that remained unchanged over a long time, atmosphere values that were equal to zero, and the values that were unreasonably high or low. Table 6 shows the rate of outliers and missing data in the original dataset.

Table 6 Missing data and outlier rate of the dataset.

Full size table

There are many different approaches to preprocessing data, and users can use any appropriate methods that they are familiar with or proficient in. We suggest an upward/downward completion or a linear interpolation approach for the data samples where small steps (e.g., less than 10 steps) are missing. A moving average method can be considered when intermittent time steps (e.g., less than 100 steps) are missing; however, for long-term (e.g., more than 100 steps) missing data cases, the removal of these samples is recommended. In addition, the on-site dataset should not be adopted if the missing data rate is larger than a specific rate (i.e., 20%) of the total dataset; for example, at solar station site 3, most of the total solar irradiance points were outliers after August 1^st, 2019. Figure 4 shows the boxplot of one key feature variable of wind and solar generation (missing data points were dropped before plotting the boxplot). The outliers can be seen in this figure. We provided both the original and processed dataset in the repository so that users can process the missing data and outliers using their own rules or use the processed dataset directly. It is worth noting that we only processed missing data such as ‘NA’, ‘0.001’, ‘-99’, and ‘--’ in the data files of data_processed folder, and the used approach was the simplest upward/downward completion. The outliers shown in Fig. 4 could be removed or not according to the data user themselves because these data points are classified as outliers by a specific criterion that the data is outside 1.5 times the interquartile range (IQR) including above the upper quartile (Q3 + 1.5*IQR) and below the lower quartile (Q1-1.5*IQR). Owing to the fluctuated characteristics of renewable energy, actually, some outliers in Fig. 4 could be a meaningful data point for developing a data-driven forecasting model.

Correlation analysis

In developing a data-driven forecasting model, selecting the proper input feature variables can improve the forecasting performance; therefore, correlation analysis is important for selecting the variables. Wind speed and solar radiation are the most important factors for generating wind and solar power, respectively. The Pearson correlation coefficient (PCC) is a measure of linear correlation between two sets of data. We found that the PCC between wind speed and power output in the wind dataset is much higher than other parameters, such as temperature and pressure (see Fig. 5). Similarly, in the solar dataset, total solar irradiance has the highest PCC with the power output, as shown in Fig. 6.

Usage Notes

The data preprocessing methods for the missing data and outliers impact the forecasting performance of machine learning models. The dataset was used for the Chinese State Grid Renewable Energy Generation Forecasting Competition. On-site weather conditions such as wind speed, wind direction, and solar radiation are the main input feature variables that influence the generation of power. For the wind generation power forecasting case, wind speed is the main factor. For the solar energy generation case, solar radiation variables are the main factors. Many machine learning algorithms, such as GANs, LightGBM, SVM, random forest, CNNs, and LSTM, can be developed using this dataset to predict wind and solar energy generation in the short term in the future (e.g., one day or one week). It is worth noting that forecasting weather data is required when the developed model is used to perform forecasting tasks.

The selection of the input feature variables is important for developing a model. Generally, more dimensions of input feature variables could improve the forecasting performance owing to more information being taken into consideration. However, some variables are highly correlated, such as wind speed, at different height levels.

In the process of training and validating our model, we found that the implementation of data classification technology can improve forecasting accuracy. As shown in Fig. 7, the wind speed and solar radiation change seasonally. Several classification methods are suggested, including seasonal classification, classification by wind speed, and classification by the intensity of solar radiation. When we make the classification, each classification label should have a similar sample size. Table 7 shows one of the classifications by wind speed examples in the case of forecasting wind power generation.

Table 7 An example of classification by wind speed.

Full size table

Another application of this dataset is the beneficial implementation of DR programs in the grid. For power grids, especially a distributed energy system, renewable energy is intermittent, so the demand side should be coordinately managed with power generation. With the forecasting of days-ahead renewable energy generation, energy management and control systems can be further optimized.

Code availability

All the code and processing scripts used to produce the results of this paper were written in Python, Jupyter lab. Links to scripts and data for analysis can be found in the GitHub repository (https://github.com/Bob05757/Renewable-energy-generation-input-feature-variables-analysis).

References

Chen, Y., Xu, P., Gu, J., Schmidt, F. & Li, W. Measures to improve energy demand flexibility in buildings for demand response (DR): A review. Energy Build. 177, 125–139 (2018).
Article Google Scholar
O’Shaughnessy, E., Cruce, J. R. & Xu, K. Too much of a good thing? Global trends in the curtailment of solar PV. Sol Energy 208, 1068–1077 (2020).
Article ADS Google Scholar
Qi, Y., Dong, W., Dong, C. & Huang, C. Understanding institutional barriers for wind curtailment in China. Renew. Sust. Energ. Rev. 105, 476–486 (2019).
Article Google Scholar
Chen, Y. et al. Experimental investigation of demand response potential of buildings: Combined passive thermal mass and active storage. Appl. Energy 280, 115956 (2020).
Article Google Scholar
Hamza Zafar, M. et al. Adaptive ML-based technique for renewable energy system power forecasting in hybrid PV-Wind farms power conversion systems. Energy Convers. Manag. 258, 115564 (2022).
Article Google Scholar
Ren, Y., Suganthan, P. N. & Srikanth, N. Ensemble methods for wind and solar power forecasting—A state-of-the-art review. Renew. Sust. Energ. Rev. 50, 82–91 (2015).
Article Google Scholar
Chen, Y., Wang, Y., Kirschen, D. & Zhang, B. Model-Free Renewable Scenario Generation Using Generative Adversarial Networks. IEEE Trans. Power Syst. 33, 3265–3275 (2018).
Article ADS Google Scholar
Aslam, S. et al. A survey on deep learning methods for power load and renewable energy forecasting in smart microgrids. Renew. Sust. Energ. Rev. 144, 110992 (2021).
Article Google Scholar
Ahmad, T., Zhang, D. & Huang, C. Methodological framework for short-and medium-term energy, solar and wind power forecasting with stochastic-based machine learning approach to monetary and energy policy applications. Energy 231, 120911 (2021).
Article Google Scholar
Suárez-Cetrulo, A. L., Burnham-King, L., Haughton, D. & Carbajo, R. S. Wind power forecasting using ensemble learning for day-ahead energy trading. Renew. Energy 191, 685–698 (2022).
Article Google Scholar
Zhang, Y., Ai, Q., Xiao, F., Hao, R. & Lu, T. Typical wind power scenario generation for multiple wind farms using conditional improved Wasserstein generative adversarial network. Int. J. Electr. Power Energy Syst 114, 105388 (2020).
Article Google Scholar
Yuan, R., Wang, B., Mao, Z. & Watada, J. Multi-objective wind power scenario forecasting based on PG-GAN. Energy 226, 120379 (2021).
Article Google Scholar
Draxl, C., Clifton, A., Hodge, B.-M. & McCaa, J. The Wind Integration National Dataset (WIND) Toolkit. Appl. Energy 151, 355–366 (2015).
Article Google Scholar
Agee, P., Nikdel, L. & Roberts, S. A measured energy use, solar production, and building air leakage dataset for a zero energy commercial building. Sci. Data 8, 299 (2021).
Article Google Scholar
Zhang, T., Tian, B., Sengupta, D., Zhang, L. & Si, Y. Global offshore wind turbine dataset. Sci. Data 8, 191 (2021).
Article Google Scholar
Dunnett, S., Sorichetta, A., Taylor, G. & Eigenbrod, F. Harmonised global datasets of wind and solar farm locations and power. Sci. Data 7, 130 (2020).
Article Google Scholar
Jiang, H., Lu, N., Qin, J. & Yao, L. Hourly 5-km surface total and diffuse solar radiation in China, 2007–2018. Sci. Data 7, 311 (2020).
Article Google Scholar
Kermani, M. et al. Intelligent energy management based on SCADA system in a real Microgrid for smart building applications. Renew. Energy 171, 1115–1127 (2021).
Article Google Scholar
Chen, Y. & Xu, J. Source code for: Solar and wind power data from the Chinese State Grid Renewable Energy Generation Forecasting Competition. Figshare https://doi.org/10.6084/m9.figshare.17304221.v4 (2022).

Download references

Acknowledgements

This work was funded by the China Postdoctoral Science Foundation (No. 2020M681347). We greatly appreciate the Chinese State Grid hosting the Renewable Energy Generation Forecasting Competition and sharing the dataset.

Author information

Authors and Affiliations

School of Energy and Power Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
Yongbao Chen & Junjie Xu
Shanghai Key Laboratory of Multiphase Flow and Heat Transfer in Power Engineering, Shanghai, 200093, China
Yongbao Chen

Authors

Yongbao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Xu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yongbao Chen performed quality control on the dataset, wrote the paper, performed validation analyses, and edited the dataset. Junjie Xu double-checked the quality of the dataset and reviewed the manuscript.

Corresponding author

Correspondence to Yongbao Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Y., Xu, J. Solar and wind power data from the Chinese State Grid Renewable Energy Generation Forecasting Competition. Sci Data 9, 577 (2022). https://doi.org/10.1038/s41597-022-01696-6

Download citation

Received: 21 January 2022
Accepted: 26 August 2022
Published: 21 September 2022
DOI: https://doi.org/10.1038/s41597-022-01696-6