Introduction

Fengyun-3 (FY-3) is China's second-generation polar orbiting meteorological satellite. It can obtain global, all-weather, three-dimensional, quantitative, multi-spectral atmospheric, land surface and sea surface characteristic parameters. FY-3A and FY-3B both carry VIRR. The VIRR has 10 spectral bands (0.43–12.50 μm) and the spatial resolution is 1.1 km. It is mainly used to monitor the global cloud cover, ocean surface temperature, and vegetation growth status and types1. Terra and Aqua are satellites in the US (United States) Earth Observation System (EOS) program. Both satellites are equipped with MODIS. MODIS has 36 bands (0.405–14.385 μm), among which the spatial resolution of band 1–2 is 250 m, and the spatial resolution band 3–7 is 500 m, the spatial resolution of band 8–36 is 1000 m, and the scan width is 2330 km. It can be used for long-term global observation of the earth's surface, biosphere, solid earth, atmosphere and ocean. MODIS has stable performance and good calibration, and has obtained many years of data since its launch2,3,4. If the data of Terra/MODIS and Aqua/MODIS are compared with FY-3/VIRR, the performance of FY-3 can be preliminary evaluated, which can supplement and correct to a certain extent. AVHRR is the main detection instrument of the National Oceanic and Atmospheric Administration (NOAA) series of satellites. It is a scanning radiometer with five spectral channels. The scanning angle of the on-board detector is ± 55.4°, which is equivalent to detecting a band of 2800 km wide on the ground. The sub-satellite point resolution of AVHRR is 1.1 km. There are currently two types of AVHRR data on a global scale: NOAA Global Area Coverage (GAC) data and NOAA Global Vegetation Index (GVI) data5. GVI is composed of pixels with the largest NDVI value in images for 7 consecutive days. AVHRR has been producing global vegetation data since 19826,7,8.

Vegetation plays an extremely important role in the earth's ecosystem. It affects the climate, hydrology and biochemical environment while being restricted by these factors. Therefore, vegetation is an important indicator to measure the impact of climate, humanities and biochemical factors on the environment9,10,11,12. NDVI is a commonly used vegetation index. Because of its simple calculation, wide spatial coverage, and high detection sensitivity, it is widely used in vegetation monitoring and is one of the important parameters describing the characteristics of surface vegetation13,14. At present, a large number of scholars have conducted interactive comparisons between different sensors15,16. For example, Duarte et al.12 derived phenological indexes from NDVI through an open source tool developed by QGIS (A free and open source geographic information system). In the present work, the developed toolbar was applied to MODIS data covering a particular region of Portugal, which can be generally applied to other satellite data and study area. The code is open and can be modified according to the user requirements. Other advantage in publishing the plug-ins and the application code is the possibility of other users to improve this application. Feng Rui et al.17 conducted a differential analysis on the NDVI of FY/MERSI (Medium Resolution Spectral Imager) and EOS/MODIS and found that the inversion results of various ground features showed good linear consistency; Yuan Zhengwu et al.18 established a quantitative relationship between Landsat TM (Thematic Mapper) and HJCCD (Environmental satellite CCD camera) vegetation index, which provides a basis for the comprehensive application of Landsat TM and HJCCD data; Xu Hanqiu et al.19 analyzed the characteristics of the red band and near-infrared band of ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) and Landsat ETM + (Enhanced Thematic Mapper) by comparing the vegetation index. Wu Wenbin et al.20 compared the Savitzky-Golay filter method for fitting NDVI time series data and the asymmetric Gaussian function fitting method for NDVI time series data; Li Jing et al.21 used the NDVI data of the Southwest Virginia coal field in the United States from 1984 to 2010 as the data source, and compared the filtering algorithms of the three long-term remote sensing vegetation index data sets of TIMESAT3.1 (Time-series Satellite data Analysis Tool 3.1); Sha Sha et al.22 took Maqu as an example to compare and analyze the three sets of NDVI long-term series indices, NDVI/MODIS, NDVI/GIMMS (Global Inventory Modeling and Mapping Studies) and NDVI/NSMC (National Center for Space Weather). The vegetation index of a longer time series is a simple and effective dynamic monitoring research parameter, which is very important for monitoring surface vegetation, ecological improvement, ecological evaluation and so on12,17,18,19,20,21,22.

Machine learning is an important method in artificial intelligence. It can solve some complex problems through algorithms and has become one of the most popular subjects at the moment23,24,25,26. The application of machine learning in remote sensing is generally divided into the following steps27,28: collecting and cleaning data, model building, selecting the correct algorithm, obtaining reliable results, and visualizing data. In remote sensing technology, people mainly use satellites or drones to collect data29. Data cleaning occurs when our data set is incomplete or missing values, and the choice of algorithm involves understanding one of the problems to be solved. If the model is only for forecasting, not for obtaining high-reliability results, then this workflow will end here. However, if a person is writing a research paper, or wants to obtain highly credible results, then you need to use a graphics library to plot the results and get the true solution from the chart data30.

A training sample set was built based on linear regression algorithm, combining the normalized vegetation index products retrieved by Fengyun satellite and MODIS and observation parameters, surface type, ground elevation and meteorological factors. Therefore, the NDVI product retrieved by Fengyun Satellite is corrected to the NDVI product that is basically consistent with MODIS through the machine learning model, related factors and related parameters, and a long-term normalized vegetation index is obtained.

Because of the difference of sensitivity, resolution and observation method, different detection instruments have certain differences in the detection value of NDVI. Therefore, this study compares the NDVI of MODIS with the NDVI of VIRR and AVHRR respectively. We use statistical methods to compare and analyze the normalized vegetation index of the three, and find the difference and correlation between the normalized vegetation index of VIRR, AVHRR and MODIS; based on the machine learning algorithm, the Fengyun Satellite NDVI correction algorithm is constructed in order to form a long time series of vegetation index.

Data and methods

Data

This paper selects parts of China and surrounding areas as the research area. The research data selects the NDVI data of MODIS (NDVIm) and AVHRR (NDVIa) sensors on Terra and Aqua, and the NDVI data of VIRR (NDVIv) sensors on Fengyun satellite31. (I) Compare the NDVIv with the NDVIa, and the NDVIa and NDVIm. (II) Find out the functional relationship between NDVIa and NDVIm, and the functional relationship between NDVIv and NDVIa through comparison. (III) use NDVIa to correct NDVIv data to a level equivalent to NDVIm.

The data used in this study include (see Table 1): NDVIa from 1982 to 2015, NDVIm from 2000 to 2019, and NDVIv from 2015 to 2020, all of which have a resolution of 0.05°. Because in 2005, there are both NDVIa data and NDVIm data. Therefore, we use the data of this year to compare NDVIa and NDVIm, and explore the correlation between the two. Because in 2015, there are both NDVIv data and NDVIa data. Therefore, we used the data of this year to compare NDVIv and NDVIa and explore the correlation between the two. Finally, we compared the corrected NDVIv of 2019 with the NDVIm of 2019 to verify the success of the model we constructed.

Table 1 Study data information.

Figure 1 shows the spectral response function curves of different satellite sensors in the visible and near-infrared spectrum32. By comparison, it can be found that in the visible light band, the spectral response function of MODIS is narrower than AVHRR, and the spectral response function of AVHRR is narrower than VIRR. In the near-infrared band, MODIS still has the narrowest spectral response function, followed by VIRR, and AVHRR has the widest spectral response function. The channel, wavelength range, corresponding spectrum and sub-satellite resolution information of MODIS, AVHRR, and VIRR sensors are shown in Table 2.

Figure 1
figure 1

Spectral response function curves of different satellite sensors in the visible and near-infrared spectrum29.

Table 2 Corresponding channels of the main NDVI data source sensors.

Method

Linear model is a form of machine learning model. The form of linear model is relatively simple and easy to model. The linear model contains some important basic ideas in machine learning. Many more powerful nonlinear models can be obtained by introducing hierarchical structure or high-dimensional mapping on the basis of linear models. There are many forms of linear models, and linear regression is a common one. Linear regression tries to learn a linear model to predict the real-valued output markers as accurately as possible. By establishing a linear model on the data set, a loss function is established, and finally the model parameters are determined with the goal of optimizing the cost function, so as to obtain the model for subsequent prediction. The general linear regression algorithm process is as presented in Fig. 2.

Figure 2
figure 2

Schematic diagram of the linear regression algorithm flow.

The detailed procedure is as follows33:

  1. (I)

    The data is standardized and preprocessed. The preprocessing includes data cleaning, screening, organization, etc., so that the data can be input into the machine learning model as feature variables.

  2. (II)

    Different machine learning algorithms are selected to train a separate data set, and find the best machine learning model, establish a machine learning model based on the normalized vegetation index product retrieved by Fengyun satellite.

  3. (III)

    Verify and output the long-term series normalized vegetation index of the Fengyun satellite.

For 2001–2005, there are both AVHRR NDVI data and MODIS NDVI data. Therefore, we used the data of these 5 years to compare NDVIa and NDVIm and explore the correlation between the two. Because 2015 has both VIRR's NDVI data and AVHRR's NDVI data. Therefore, we used the data of this year to compare NDVIv and NDVIa and explore the correlation between the two. Finally, we compared the corrected NDVIv of 2019 with the NDVIm of 2019 to verify the success of the model we constructed.

The linear machine learning model is used to construct the optimal functional relationship between the NDVIa and the NDVIm. The formula is as presented in formula (1):

$${\text{Y}}_{{{\text{NDVIm}}}} = \left\{ {{\text{k2}}00{1},{\text{k2}}00{2},{\text{k2}}00{3},{\text{k2}}00{4},{\text{k2}}00{5},{\text{kmin}},{\text{kmax}},{\text{kave}}} \right\} \times {\text{X}}_{{{\text{NDVIa}}}} + \left\{ {{\text{m2}}00{1},{\text{m2}}00{2},{\text{m2}}00{3},{\text{m2}}00{4},{\text{m2}}00{5},{\text{mmin}},{\text{mmax}},{\text{mmean}}} \right\}$$
(1)

In the formula, XNDVIa is the NDVI value of AVHRR, YNDVIm is the NDVI value of MODIS, k is the coefficient value of the linear function relationship between NDVIa and NDVIm, k2001, k2002, k2003, k2004, k2005, kmin, kmax, kave are the coefficients of 2001, 2002, 2003, 2004, 2005, the 5-year minimum, 5-year maximum, and the 5-year coefficient average respectively. m is the intercept of the linear function relationship between the NDVIa and the NDVIm, m2001, m2002, m2003, m2004, m2005, mmin, mmax, mmean are the intercept of 2001, 2002, 2003, 2004, 2005 Year, 5-year minimum, 55-year maximum, and 5-year average respectively.

Through multiple cross-comparison analysis, the optimal coefficient k and the optimal coefficient m are selected, and then the optimal functional relationship between NDVIa and NDVIm is determined.

Based on the above analysis, we continue to construct the functional relationship between NDVIa and NDVIv, according to formula (2).

$${\text{X}}_{{{\text{NDVIa}}}} = {\text{aZ}}_{{{\text{NDVIv}}}} + {\text{b}}{.}$$
(2)

In the formula (2), ZNDVIv is the NDVI value of VIRR, XNDVIa is the NDVI value of AVHRR, a is the coefficient value of the linear function relationship between the NDVIv and the NDVIa fitting, and b is the intercept of the linear function relationship between NDVIv and NDVIa fitting.

Replacing the functional relationship between NDVIa and NDVIv into the optimal NDVIa and NDVIm functional relationships filtered out to obtain the refitted NDVIv, which is Yvir_ndvi in the formula (3). The functional relationship formula of the simulated NDVIv is as follows (3):

$${\text{C}}_{{{\text{NDVIcv}}}} = {\text{k}}_{{{\text{NDVIa}}}} + {\text{m}} = {\text{k}}\left( {{\text{aZ}}_{{{\text{NDVIv}}}} + {\text{b}}} \right) + {\text{m}} = {\text{kaZ}}_{{{\text{NDVIv}}}} + {\text{kb}} + {\text{m}}{.}$$
(3)

In the formula, CNDVIcv is the corrected NDVIv(NDVIcv), k is the optimal coefficient of the correlation between NDVIa and NDVIm, and m is the optimal intercept of the correlation between NDVIa and NDVIm.

The data of 2005 were selected to compare NDVIm and NDVIa in some parts of China and surrounding areas. The data of 2015 were selected to compare NDVIv and NDVIa in some parts of China and surrounding areas. Through analysis, the correlation among NDVIv, NDVIa and NDVIm is found.

Results

Comparison of NDVI between AVHRR and MODIS in parts of China and surrounding areas

By comparing the NDVIa and NDVIm in parts of China and surrounding areas (Table 3), we found that the correlation coefficient for January, April, July, and October of 2005 was between 0.8652 and 0.9348, and the coefficient of determination was between 0.7024 and 0.8519. The confidence p is at the level of 0.01 or 0.05, indicating that the NDVIa and the NDVIm have a good correlation. Through comparison, it can be found that overall, in comparison with the NDVIa and NDVIm in 2005, except that the NDVIa in January was larger than the NDVIm, the NDVIa in the remaining months were all smaller than the NDVIm. As can be seen from the Fig. 3, in the comparison between April and October, the area with lower NDVI value, the NDVIa is greater than the NDVIm.

Table 3 Comparison results of NDVIa and NDVIm.
Figure 3
figure 3

NDVI values monitored by AVHRR and MODIS in parts of China and surrounding areas in January, April, July and October 2005 (a1 is the NDVIa of January, a2 is the NDVIm of January; b1 is the NDVIa of April, b2 is the NDVIm of April;c1 is the NDVIa of July, c2 is the NDVIm of July;d1 is the NDVIa of October, d2 is the NDVIm of October) (Made by: Python 3.8.3 https://www.python.org/downloads/release/python-383/).

Comparison of NDVI between VIRR and AVHRR in parts of China and surrounding areas

By comparing the NDVIv and NDVIa in parts of China and surrounding areas (Table 4), we found that the correlation coefficient for January, April, July, and October of 2015 was between 0.7238 and 0.8929, and the coefficient of determination was between 0.6072 and 0.8299. The confidence p is at the level of 0.01 or 0.05, indicating that there is a significant correlation between the NDVIv and the NDVIa. The NDVIa can be used to correct the NDVIv. Through comparison, it can be found that, on the whole, the NDVIv in January, April, July, and October of 2015 is smaller than the NDVIa. However, it can be seen from Fig. 4 that in the comparison in April, the NDVIv is greater than the NDVIa in areas with lower NDVI values.

Table 4 Comparison results of NDVIv and NDVIa in 2015.
Figure 4
figure 4

The NDVI values monitored by VIRR and AVHRR in parts of China and surrounding areas in January, April, July and October 2015 (a1 is the NDVIv of January, a2 is the NDVIa of January; b1 is the NDVIv of April, b2 is the NDVIv of April; c1 is the NDVIv of July, c2 is the NDVIv of July; d1 is the NDVIv of October, d2 is the NDVIa of October) (Made by: Python 3.8.3 https://www.python.org/downloads/release/python-383/).

Through the comparative analysis of 3.1 and 3.2, we found that there is a significant correlation between the NDVIa and the NDVIm, and the NDVIa is smaller than the NDVIm, and there is also a significant relationship between the NDVIv and the NDVIa. and the NDVIv is smaller than the NDVIa. Therefore, we deduce the following correlation: NDVIv < NDVIa < NDVIm.

Construction of NDVI correction algorithm based on linear machine learning model

We construct a linear machine learning model according to the 2.2 method.

In order to increase our selectivity in the linear model and the accuracy of the correction, we respectively compared the NDVIa and the NDVIm from January, April, July, and October from 2001 to 2005. The specific analysis and comparison results are as follows.

By comparing the NDVIa and the NDVIm in January (Table 5), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm were 1.0358–1.0679, the average coefficient was 1.0553, and the intercept was − 0.0564 to − 0.0285. The average intercept is − 0.0409, the correlation coefficient r is 0.8638–0.8768, the average correlation coefficient is 0.8693, the determination coefficient R2 is 0.7024–0.7662, the average correlation coefficient is 0.7400, and the confidence is 0.0133–0.0175, all of which are at the 0.05 level of confidence. The average confidence is 0.0149.

Table 5 Comparison and analysis of NDVIa and NDVIm in January from 2001 to 2005.

By comparing the NDVIa and the NDVIm in April (Table 6), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm were 1.1504–1.1823, the average coefficient was 1.1637, and the intercept was − 0.0415 to − 0.0272. The average intercept is − 0.0332, the correlation coefficient r is 0.9070–0.9137, the average correlation coefficient is 0.9102, the determination coefficient R2 is 0.8070–0.8362, the average correlation coefficient is 0.8184, and the confidence is 0.0069–0.0081, which are all within the confidence level of 0.01. The average confidence is 0.0076.

Table 6 Comparison and analysis of NDVIa and NDVIm in April from 2001 to 2005.

Through the comparison of the NDVIa and the NDVIm in July (Table 7), it is found that from 2001 to 2005, the coefficients of the NDVIa and the NDVIm are 1.0928–1.1191, the average coefficient is 1.1026, the intercept is 0.0229–0.0382, and the average intercept is 1.0928–1.1191. The distance is 0.0301, the correlation coefficient r is 0.9341–0.9395, the average correlation coefficient is 0.9370, the determination coefficient R2 is 0.7741–0.8008, the average correlation coefficient is 0.7870, and the confidence is 0.0149–0.0173, all at the 0.05 level of confidence, the average confidence It is 0.0160.

Table 7 Comparison and analysis of NDVIa and NDVIm in July from 2001 to 2005.

By comparing the NDVIa and NDVIm in October (Table 8), it is found that from 2001 to 2005, the coefficient of NDVIa and NDVIm was 1.1349–1.1809, the average coefficient was 1.1523, and the intercept was − 0.0330 to − 0.0113. The average intercept is -0.0189, the correlation coefficient r is 0.8903–0.9048, the average correlation coefficient is 0.8985, the determination coefficient R2 is 0.8521–0.8777, the average correlation coefficient is 0.8619, and the confidence is 0.0071–0.0088, which are all within the confidence level of 0.01. The average confidence is 0.0081.

Table 8 Comparison and analysis of NDVIa and NDVIm in October from 2001 to 2005.

Comparative analysis of the revised NDVIv and NDVIm

Use the methods and data of 2.2 and 3.2 to construct a linear machine learning model, correct the NDVIv in parts of China and surrounding areas, find the best fitting function of the corrected NDVIv, and correct the NDVIv values in parts of China and surrounding areas in January, April, July, and October. Simultaneously compare the NDVIm in the same area. The constructed fitting model and comparative analysis are shown in Fig. 5 and Table 9.

Figure 5
figure 5

The revised NDVI value comparison between VIRR and MODIS in January, April, July, and October of 2019 in parts of China and surrounding areas (Jan-a is the revised NDVIv of January, Jan-b is the NDVIm of January; Apr-a is the revised NDVIv of April, Apr-b is the NDVIm of April; Jul-a is the revised NDVIv of July, Jul-b is the NDVIm of July; Oct-a is the revised NDVIv of October, and Oct-b is the NDVIm of October) (Made by: Python 3.8.3 https://www.python.org/downloads/release/python-383/).

Table 9 The optimal function construction of NDVIv fitting and its comparison with NDVIm.

After the NDVIv is corrected, compared with the NDVIm (Table 8), the correlation coefficient between the NDVIv and the NDVIm before correction is 0.7238–0.8929, and the correlation coefficient after correction is increased to 0.9126–0.9445, and the correlation coefficient of NDVIv before correction is 0.9126–0.9445. The coefficient of determination of the NDVIv value and the NDVIm is 0.6072–0.8299, and the corrected coefficient of determination is increased to 0.9002–0.9326, and the confidence level is also increased from the original 0.05 or 0.01 level to above the 0.01 level, even reaching the 0.001 level confidence level. At the same time, it can be seen from Fig. 5 that the revised NDVIv has a substantially improved consistency compared with the NDVIm. Prove that our correction method is feasible.

Conclusion and discussion

Through research, we found that there is a significant correlation between the NDVIa and the NDVIm, and there is also a significant correlation between the NDVIv and the NDVIa. The relationship between the three is NDVIv < NDVIa < NDVIm.

Using the constructed linear model in machine learning, the NDVIv was corrected, and compared with the NDVIm, there is a good consistency. The correlation coefficient before correction is 0.7238–0.8929, the correlation coefficient after correction is significantly improved to 0.9126–0.9445; the determination coefficient before correction is 0.6072–0.8299, the correlation coefficient after correction is significantly improved to 0.9002–0.9326, and the confidence levels are all significant correlations less than 0.01. It is proved that the corrected NDVIv has significantly improved accuracy and product quality compared with the NDVIm.

In addition, we have the following thoughts. Firstly, in this study, we use linear machine learning models to correct the NDVIv in parts of China and surrounding areas. In some areas, the NDVIv and the NDVIm may not have a linear relationship, and it is likely to be a non-linear relationship, so in future research, different machine learning models should be used to correct Fengyun satellite products, such as decision trees, neural networks, and support vector machines. At the same time, we should construct regional machine learning models for parts of China and surrounding areas according to different terrains, different meteorological conditions, and different atmospheric conditions, and use different machine learning methods for different regions to correct Fengyun Satellite products. This may be able to more objectively and accurately correct the NDVIv value to a level equivalent to that of NDVIm.

Secondly, we are currently correcting the Fengyun satellite data from the product level. Although the corrected product is closer to the MODIS product value, it is a physical correction, and the mechanism is not very strong. In the future, we will consider revising the Fengyun Satellite’s NDVI products from the near-infrared and infrared bands, compare and analyze the Fengyun Satellite’s infrared band with the MODIS infrared band, and conduct a comparative analysis on the Fengyun Satellite’s near-infrared band and MODIS’s near-infrared band. It is possible to correct the Fengyun Satellite’s NDVI products to a level closer to that of NDVIm. At the same time, it is also possible to find out the reasons for the inaccuracy of the Fengyun Satellite inversion values, thereby improving the Fengyun Satellite’s own monitoring accuracy.

Thirdly, studies have shown that there is still a certain gap between the NDVIm or NDVIa and the NDVI values observed on the ground34,35. Therefore, it is also unreasonable to use the NDVIm as the true value of NDVI. It should also be considered to compare and analyze the Fengyun Satellite’s NDVI product with the NDVI value retrieved from ground observation data, so as to improve the accuracy of the Fengyun Satellite’s NDVI product, and it is also a meaningful thing to improve the accuracy of the Fengyun Satellite.

The product of Fengyun Satellite is not well applied in the process of use at present, and one of the most important reasons is that its observation value differs greatly from the actual value. The NDVI observed by MODIS at present is internationally recognized as a relatively accurate observation value. Therefore, one of the main contents of this study is to build a model, to calibrate the NDVIv to the level of the NDVIm. This goal is realized in this study. In addition, this study aims to form a set of long-time series NDVI observed by MODIS, AVHRR and VIRR at the same time, so that other instruments can replace the missing data of one instrument. Through this study, it is realized that the NDVI value of VIRR instrument of China's Fengyun Satellite can be applied after correction when MODIS or AVHRR data is missing.