Background & Summary

The usage of renewable energy is increasingly important to reduce carbon emissions and protect our environment. Currently, renewable energy penetration in the grid is increasing worldwide. The power supply must simultaneously match the demand; otherwise, power imbalance problems occur in the power grid. These problems hinder the continuous development of renewable energy1, and overgeneration problems occur2,3. As renewable energies such as solar energy and wind power are intermittent energy resources, it will be difficult for these energy sources to fully replace fossil energy in the foreseeable future. Energy storage and demand response (DR) are two promising technologies that can be utilized to alleviate power imbalance problems and provide more renewable energy in the power grid in the future4.

Despite implementing DR or designing an energy storage system, an accurate forecasting model for renewable energy generation is crucial to optimize the power system and allow more renewable energies to penetrate into the grid5. Without accurate and reliable forecasting of renewable energy generation, the maximum benefits from the energy management system cannot be realized. Usually, renewable energy generation forecasting can be categorized into four types based on the time horizon, i.e., very short term (less than 30 min), short term (30 min-6 h), medium term (6–24 h) and long term (1–7 d)6. However, unlike forecasting the electrical consumption of a building, which is generally regular, forecasting renewable energy generation is notoriously difficult due to energy generation variability, which, according to previous studies, is deeply influenced by meteorological conditions7,8. Data-driven models such as machine learning algorithms have been well recognized in the field of big data science to deduct nonlinear relationships between independent and dependent variables9. Therefore, researchers have spent much effort on developing machine learning models. Machine learning algorithms such as generative adversarial networks (GANs), convolutional neural networks (CNNs), long short-term memory (LSTM) and ensemble methods are widely used8,10. GANs have been considered the most efficient algorithm to capture the intermittency and fluctuation characteristics of wind and solar energy generation in recent years11,12. GANs is a promising architecture in renewable scenarios generation, owing to the ability to avoid complex feature extraction and cumbersome manual labeling process that are required in the conventional data-driven model12. Furthermore, GANs can effectively depict the inherent stochastic and dynamic characteristics of renewable resources with no need for statistical assumptions. All in all, GANs leverages the capabilities of deep learning and the power of data-driven techniques to address the difficulty of scenario generation.

The amount and quality of the dataset is the fundamental factor in the development of a data-driven forecasting model. Figure 1 shows the main diagram of developing a data-driven model for wind energy generation forecasting. Generally, there are two types of original datasets: simulated datasets and on-site collected datasets. The NREL Wind Integration Dataset is a widely used dataset13, and it provides simulated wind data from more than 126,000 land-based and offshore wind power production sites with a 2-km grid over the United States at a 5-min resolution. Datasets derived by analyzing satellite imagery are also common and effective. Through this method, a large-scale (i.e., city- or country-scale) dataset can be obtained. Simulated datasets are usually based on assumptions that are not always in accordance with real situations. On-site measurements are usually more accurate, and they are also more appropriate for the development of forecasting modes for a specific location. However, these data are difficult to collect. Agee et al. reported over six years of solar energy production data at a 1-hour resolution from a residential building (328 m2) in Virginia, USA14. Zhang et al. presented the global offshore wind turbine dataset15. There is a platform called OpenStreetMap that is used to recreate new versions of wind and solar installation datasets16. Solar radiation information is an indispensable parameter in analyzing solar generation. Jiang et al. presented a twelve-year (2007–2018) hourly dataset with 5-km resolution of surface and diffuse solar radiation in China17. Furthermore, more dataset repositories can be found in the review in8.

Fig. 1
figure 1

Flow diagram of data-driven model development process for wind energy forecasting.

Although some solar and wind generation datasets have been made publicly available, few of them have focused on on-site wind farms and solar stations. Compared with simulated datasets, the on-site dataset is more meaningful for the development of a good generalization model. In developing a data-driven model to forecast renewable energy generation, feature variables such as wind speed and direction, solar irradiance and temperature are important variables used to train and validate the model. The motivation of this paper is to provide an on-site collected dataset for a better understanding of renewable energy generation characteristics, which are influenced by meteorological conditions and system parameters. Therefore, data-driven models can be developed using the dataset. This dataset was collected from six wind farms and eight solar stations in China. Based on this approach, solar and wind power forecasting models can be conveniently trained and validated.

Methods

Wind farms and solar stations are generally equipped with a supervisory control and data acquisition (SCADA) system that connects hardware and software for monitoring, controlling and analyzing processes such as data visualization, alarm function, fault detection and emergency offload. A detailed introduction of the SCADA system can be found in18. The data of these six selected wind farms and eight solar stations were collected using SCADA systems. The facilities’ basic information and the nominal output capacity are listed in Tables 1, 4. The sensor architecture of the monitor systems for wind farms and solar stations are presented in Fig. 2 and Fig. 3, respectively. Data were accessed through the remote monitor platform and downloaded as.xlsx files by the authorized owner. The nominal power output capacity of these selected wind farms ranged from 36 MW to 200 MW, and the capacity of these selected eight solar stations ranged from 30 MW to 130 MW.

Table 1 Basic information on the wind turbines of each wind farm, which includes the wind turbine model and number and detailed information.
Fig. 2
figure 2

Sensor architecture and data collection process of the wind farms.

Fig. 3
figure 3

Sensor architecture and data collection process of the solar stations.

To cover different climate zones and geographic locations, the selected solar stations and wind farm sites included areas in North, Central, and Northwest China, and the terrain included deserts, mountains and plains. It should be noted that all the original datasets were obtained and provided by a third-party, the Chinese State Grid, and the data collection process was out of the authors’ control.

Data Records

In this section, the data types and the structure of the dataset, which can be downloaded from Figshare19 or GitHub (https://github.com/Bob05757/Renewable-energy-generation-input-feature-variables-analysis), are described. In the following subsections, the solar and wind data files are presented to guide users. There are two folders in the data repository; one is the folder that contains the original data with no data preprocessing, and the other folder contains data that was preprocessed based on the methods in The processing of the missing data and outliers subsection.

Wind power generation

Wind power generation data are in the wind_farms folder, which includes six Microsoft Excel files. The real-time power generation and weather conditions are recorded in these files. The basic information about each wind farm is listed in Table 1.

In each Excel file, two years (2019–2020) of data, which included on-site weather conditions and power generation, with a time granularity of 15 minutes were recorded. Table 2 describes the meaning of the column headings. The wind speed at different height levels was recorded, and the speed at the wheel hub of the wind turbine was the most important factor for predicting power generation.

Table 2 Description of the feature variables.

The statistics of each wind farm can be seen in Table 3. The nominal wind generation capacity varied from 36 MW to 200 MW, and the average real output ranged from 6.7 MW to 72.7 MW. The wind speed at the height of the wheel hub varied from 0 m/s to 36.9 m/s, and the yearly average was approximately 6.0 m/s. The air temperature varied from −24.5 °C to 37.6 °C, and the yearly average was 8.5 °C. Weather conditions at different height levels showed a similar trend. Generally, the wind speed was seasonal, showing higher speeds during summertime and lower speeds during wintertime.

Table 3 Statistics of the wind farms.

Solar energy generation

Solar power generation data are in the solar_stations folder, which includes eight Excel files. The weather condition data and real-time power generation data were recorded in these files. The power generation and PV panel information of each solar station are listed in Table 4. Similar to the wind generation dataset, two years (2019–2020) of data with a time granularity of 15 minutes were recorded. Table 2 describes the meaning of column headings. The nominal solar generation capacity varied from 30 MW to 130 MW, and the average real output ranged from 4.2 MW to 29.8 MW. The statistics of each solar station can be seen in Table 5.

Table 4 Power generation and PV panel information of each solar station, which includes the solar panel model and number and detailed information.
Table 5 Statistics of solar stations.

Technical Validation

In this section, the visualization of the data, which includes the processing of missing data, outliers, and correlation analysis of the influencing feature variables, is presented to clarify the data quality.

The processing of the missing data and outliers

The missing data include variables that were zero, null, ‘NA’, ‘0.001’, ‘−99’, and ‘–’. The outliers included weather variables that remained unchanged over a long time, atmosphere values that were equal to zero, and the values that were unreasonably high or low. Table 6 shows the rate of outliers and missing data in the original dataset.

Table 6 Missing data and outlier rate of the dataset.

There are many different approaches to preprocessing data, and users can use any appropriate methods that they are familiar with or proficient in. We suggest an upward/downward completion or a linear interpolation approach for the data samples where small steps (e.g., less than 10 steps) are missing. A moving average method can be considered when intermittent time steps (e.g., less than 100 steps) are missing; however, for long-term (e.g., more than 100 steps) missing data cases, the removal of these samples is recommended. In addition, the on-site dataset should not be adopted if the missing data rate is larger than a specific rate (i.e., 20%) of the total dataset; for example, at solar station site 3, most of the total solar irradiance points were outliers after August 1st, 2019. Figure 4 shows the boxplot of one key feature variable of wind and solar generation (missing data points were dropped before plotting the boxplot). The outliers can be seen in this figure. We provided both the original and processed dataset in the repository so that users can process the missing data and outliers using their own rules or use the processed dataset directly. It is worth noting that we only processed missing data such as ‘NA’, ‘0.001’, ‘-99’, and ‘--’ in the data files of data_processed folder, and the used approach was the simplest upward/downward completion. The outliers shown in Fig. 4 could be removed or not according to the data user themselves because these data points are classified as outliers by a specific criterion that the data is outside 1.5 times the interquartile range (IQR) including above the upper quartile (Q3 + 1.5*IQR) and below the lower quartile (Q1-1.5*IQR). Owing to the fluctuated characteristics of renewable energy, actually, some outliers in Fig. 4 could be a meaningful data point for developing a data-driven forecasting model.

Fig. 4
figure 4

Boxplots of the key features of wind farms and solar stations. Before plotting these boxplots, the missing data, such as ‘-99’ and ‘null’, were dropped. Although there are several feature variables in the dataset, we selected the most important one to show the quartiles and outliers. In subplot (a), the wind speed at hub height is presented, and in subplot (b), the total solar irradiance is presented. The Jupyter notebook on the data processing and visualization can be found in the GitHub repository (https://github.com/Bob05757/Renewable-energy-generation-input-feature-variables-analysis).

Correlation analysis

In developing a data-driven forecasting model, selecting the proper input feature variables can improve the forecasting performance; therefore, correlation analysis is important for selecting the variables. Wind speed and solar radiation are the most important factors for generating wind and solar power, respectively. The Pearson correlation coefficient (PCC) is a measure of linear correlation between two sets of data. We found that the PCC between wind speed and power output in the wind dataset is much higher than other parameters, such as temperature and pressure (see Fig. 5). Similarly, in the solar dataset, total solar irradiance has the highest PCC with the power output, as shown in Fig. 6.

Fig. 5
figure 5

Pearson correlation coefficient of different variables of the wind farms. WS_x (i.e., wind speed at different heights) has the highest PCC with respect to power. The hub height is different for each model of the wind turbine, so WS_cen represents different heights. The hub heights are 85 m, 120 m, 80 m, 85 m/90 m, 90 m, and 65 m for wind farm sites 1, 2, 3, 4, 5, and 6, respectively.

Fig. 6
figure 6

Pearson correlation coefficient of different variables of the solar stations. Generally, TSI has the highest PCC with respect to power.

Usage Notes

The data preprocessing methods for the missing data and outliers impact the forecasting performance of machine learning models. The dataset was used for the Chinese State Grid Renewable Energy Generation Forecasting Competition. On-site weather conditions such as wind speed, wind direction, and solar radiation are the main input feature variables that influence the generation of power. For the wind generation power forecasting case, wind speed is the main factor. For the solar energy generation case, solar radiation variables are the main factors. Many machine learning algorithms, such as GANs, LightGBM, SVM, random forest, CNNs, and LSTM, can be developed using this dataset to predict wind and solar energy generation in the short term in the future (e.g., one day or one week). It is worth noting that forecasting weather data is required when the developed model is used to perform forecasting tasks.

The selection of the input feature variables is important for developing a model. Generally, more dimensions of input feature variables could improve the forecasting performance owing to more information being taken into consideration. However, some variables are highly correlated, such as wind speed, at different height levels.

In the process of training and validating our model, we found that the implementation of data classification technology can improve forecasting accuracy. As shown in Fig. 7, the wind speed and solar radiation change seasonally. Several classification methods are suggested, including seasonal classification, classification by wind speed, and classification by the intensity of solar radiation. When we make the classification, each classification label should have a similar sample size. Table 7 shows one of the classifications by wind speed examples in the case of forecasting wind power generation.

Fig. 7
figure 7

Seasonal trends of the main feature variable.

Table 7 An example of classification by wind speed.

Another application of this dataset is the beneficial implementation of DR programs in the grid. For power grids, especially a distributed energy system, renewable energy is intermittent, so the demand side should be coordinately managed with power generation. With the forecasting of days-ahead renewable energy generation, energy management and control systems can be further optimized.