Larger-scale ocean-atmospheric patterns drive synergistic variability and world-wide volatility of wheat yields

Diagnosing potential predictability of global crop yields in the near term is of utmost importance for ensuring food supply and preventing socio-economic consequences. Previous studies suggest that a substantial proportion of global wheat yield variability depends on local climate and larger-scale ocean-atmospheric patterns. The science is however at its infancy to address whether synergistic variability and volatility (major departure from the normal) of multi-national crop yields can be potentially predicted by larger-scale climate drivers. Here, using observed data on wheat yields for 85 producing countries and climate variability from 1961–2013, we diagnose that wheat yields vary synergistically across key producing nations and can also be concurrently volatile, as a function of shared larger-scale climate drivers. We use a statistical approach called robust Principal Component Analysis (rPCA), to decouple and quantify the leading modes (PC) of global wheat yield variability where the top four PCs explain nearly 33% of the total variance. Diagnostics of PC1 indicate previous year’s local Air Temperature variability being the primary influence and the tropical Pacific Ocean being the most dominating larger-scale climate stimulus. Results also demonstrate that world-wide yield volatility has become more common in the current most decades, associating with warmer northern Pacific and Atlantic oceans, leading mostly to global supply shortages. As the world warms and extreme weather events become more common, this diagnostic analysis provides convincing evidence that concurrent variability and world-wide volatility of wheat yields can potentially be predicted, which has major socio-economic and commercial importance at the global scale, underscoring the urgency of common options in managing climate risk.

Due to technological improvements over time, each yield time series was expected to have a monotonic trend component. We were primarily interested in persistent yield variation and therefore we de-trended each time series using a standardization approach adopted in earlier research 1 (Equation 1). Yt = [yt -mean(yt-3,t+3)] / SD (yt-3,t+3) ………………………………Equation 1 In Eq. (1), Yt is the detrended yield value at year t, yt is the original yield value at year t, mean(yt- 3,t+3) is the mean of the original yield values for the seven-year moving window centered around year t, and SD(yt−3,t+3) is the standard deviation of the original yield values for the seven-year window centered around year t. The final yield dataset contained standardized wheat yield anomalies for 85 countries from 1964-2010 where 1964 contains the yield value of that year standardized with respect to the annual yields recorded within 1961-1967, and so on. These 85 countries accounted for about 83% of global wheat production (as per 2013 statistics). It should be noted throughout within the manuscript that all the reported yield numbers are based on 85 countries studied and missing countries include some of the major wheat producers such as Russia, Ukraine and Kazakhstan. As such, global wheat cropland area refers to cropland areas of wheat within the 85 countries. More discrepancy in the data used is mentioned in SM14.

SM3. Robust Principal Component Analysis (rPCA) method 2 :
When a dataset, such as global crop yields (85 countries and 47 years), has such a higher dimension, reducing its feature is crucial 3 to meeting our overarching goal. Principal Component Analysis (PCA) is a useful approach for dimensionality reduction in earth sciences that enables better diagnostics of the structure of the global data at multi-dimensional space and time 4 . When there are outliers in a sample dataset, such as crop yields, basic statistical methods, such as the PCA, may produce unreliable results. However, a robust version of the same method (rPCA) minimizes that common issue by recognizing abnormal observations first. In this study, we employ rPCA method in order to elucidate patterns inherent in the global year-to-year yield variability and find countries whose inter-annual yield variability tend to be more similar/opposite, as well as separating out the outlier yield estimates. The latter especially is a special outcome from the rPCA 5 . rPCA extracts orthogonal modes of year-to-year variability of crop yields (or concurrent variability of multiple national yields) from the high spatial and temporal variance in global yield dataset (in principal components/PCs) after skimming out the multi-variate outliers from the dataset and storing them in a sparse matrix.
In summary, rPCA decomposes a rectangular matrix M into a low-rank component matrix (L), and a sparse component matrix (S), by solving a convex optimization program called Principal Component Pursuit 70 . The schematic input and output data acquired from rPCA is presented in Figure S2. The S-matrix contains all the simultaneous yield outliers across all the countries in every year and L-matrix exhibits "new" interannual standardized yield anomalies for all the 85 countries. The approach to compute the sparse matrix (S) also minimizes the rank of the original data matrix (M). Therefore, the number of principal components would be a lesser number than an ordinary PCA applied on M-matrix and hence the total number of the PCs in our case is 26 ( Figure S3). Figure S2, M is the standardized yield matrix (as in section SM3) where the low rank matrix (L) delivers standardized annual yield variability after removing yield outliers from M and storing in S. The L matrix have a number of vectors, specific to each country yield, on which the PCA method is run. We used the first four PC outputs to study characteristics of concurrent yields and their connections to climate variability while the error bars of the eigenvalues by Leave-1 out cross validation bootstraps for all the first 10 PCs are shown in Figure S5. It is important to note here that, fourth PC is not entirely separable from the others, and therefore cannot be interpreted physically apart from the following PCs.

SM4. Method to investigate leading modes of world-wide wheat yield variability: Following
SM5. Countries jointly contributing to each PC: Figure 2 indicated that countries with similar directional variability in annual yields (positively covarying nations that are also having high loading values of the same sign) are assigned with the same color while those co-varying but in a different manner (high loading values of opposite signs) are indicated by the opposite colors. The histograms of all the loading values (assigned to 85 nations) corresponding to the first 4 PCs are shown in Figure S6. The vertical blue and orange lines are based on statistical mean ± one standard deviation values of 85 loading values. Countries whose loading values are greater (absolute) than these standard deviation values are highlighted by the orange or blue colors indicating highly co-varying countries (Table S1, Figure 2 (Table S1 provides more details).

SM6. Data for air temperature anomaly (ATa):
The importance of air temperature on wheat yield has been discussed in numerous studies [6][7][8][9][10][11] . Here, we used monthly air temperature anomalies data (ATa) from 1963-2010 from University of Delaware Air Temperature database at 0.5-degree resolution. We downloaded the data from NOAA/OAR/ESRL PSD, Boulder, CO, USA, from their web site at https://www.esrl.noaa.gov/psd/ for the wheat cropland grid cells ( Figure S12).

SM7. Data for Palmer Drought Severity Index (PDSI):
PDSI was developed by Palmer 12 to measure the cumulative departure in surface water balance. Here, we evaluated correlations between local PDSI variability and the PCs. PDSI is highly correlated with soil moisture content 13 , and an important factor influencing both rainfed and irrigated crop yields 14 . PDSI has a range between -10 (driest) to +10 (wettest) with values below -3 representing severe to extreme drought. The index has been widely used in studies assessing the impact of droughts on wheat [15][16][17] . We used gridded monthly self-calibrated PDSI data 18 , at 2.5-degree resolution, from 1963-2010, obtained from NOAA/OAR/ESRL-PSD (http://www.esrl.noaa.gov/psd/), Boulder, CO, USA.

SM8. Data for Sea Surface Temperature anomalies (SSTa):
SST is an essential parameter in weather prediction and atmospheric model simulations. We used monthly Extended Reconstructed Sea Surface Temperature (ERSST) version 4 dataset from 1963-2013, on global 2 x 2 grids downloaded from NOAA/OAR/ESRL PSD, Boulder, CO, USA, from their web site (https://www.esrl.noaa.gov/psd/).

SM9. Data for geopotential height (GPH):
Geopotential height approximates heights of the pressure level in the upper atmosphere and its fluctuation drives the atmospheric circulation patterns. We have used GPH at 500 hPa pressure level (Z500), which is the most important variable describing the larger-scale air flows 19 . Detrended gridded monthly GPH data, at 2.5degree resolution, was obtained from the same source as PDSI.

SM10. Data for larger-scale ocean-atmospheric indices:
Ocean-atmospheric indices are used to characterize aspects of geophysical systems such as ocean-atmospheric circulation patterns that can trigger concurrent multiple regional climate variability across the globe (https://climatedataguide.ucar.edu/climate-data/overview-climate-indices) 20 . We obtained a range of monthly atmospheric and oceanic indices in time series format, downloaded from https://www.esrl.noaa.gov/psd/data/climateindices/list/. We categorized all these indices based

SM11. Method to identify the major wheat producing-, exporting and importing-countries:
We identified 20 major wheat producers, exporters and importers among 85 countries (Table S1) based on average annual wheat production, export and import quantities within the most recent decade (2001 -2013). Like yields, we obtained all these data from FAOSTAT (http://www.fao.org/faostat/en/#data).

SM12. Method to define Most yield-Volatile Year (MVY) and Least yield-Volatile Year (LVY):
A major assumption in standard statistical methods is that the data follows a known statistical distribution 30 , but it is crucial to recognize the impacts of outliers on the outputs. Annual crop yield data recorded for the global countries can be influenced by many factors and among them climate is the chief, in general 31 . Subsequently, extreme climatic events, such as floods and droughts can reduce crop yields, which may not fall into a normal statistical distribution. Hence, it is likely that some unusual yield values (both low or high with respect to the mean within recent 7-year distribution) appearing across countries. We identified yield volatile years and countries contributing consistently to world-wide yield volatility such that we can investigate if such volatility is systematic having connections to climate (especially extremes).
There is rich body of literature relating to detection of outliers in various fields 4,5,30-32 , background modeling from surveillance video is a well-known example. Despite having issues regarding outliers, as we mentioned above, to our knowledge, there hasn't been any effort to diagnose anomalous crop yield years. While outliers can be worrisome for some datasets, we realized that, sparse matrix (S-matrix) containing crop yield outliers, within multi-dimensional space, can provide relevant information to investigate global food supply volatility and losses because there may be a good reason for the co-incidence of extreme yield volatility, or not, within many countries. As such, we examined years when there were 10 largest and 10 smallest co-incidence of extreme yield departures (number of countries) from the global mean. To do that we used rPCA outputs stored in S-matrix (refer to SM3 & SM4), and defined the most yieldvolatile years (MVY) as the years when there were top ten highest co-incidence of countries having significantly greater absolute yields in S-matrix (absolute national standardized yield larger than 1 standard deviation estimate of all the 85 national yields within the same year, Table  S4). On the other hand, a least yield-volatile year (LVY) was characterized as the year when there were smallest number of nations having coinciding anomalous yields (Table S5). We eventually used these extreme 10 + 10 years from Tables S4 and S5 respectively to make composite maps of PDSI, ATa, and SSTa.
Here is a clear step-by-step process in detail to choose each MVY and LVY: 1. We consider sparse matrix (S in Figure S2 3. Then we sort years based on the largest and the smallest standard deviation of yearly distributions in 2 where the widest/narrowest distributions defining MVY/LVY. 4. Then we remove countries whose absolute sparse value magnitudes in a year are less than the standard deviation of the corresponding year (Step 2) and hence the tails of yearly sparse yield distributions designate countries having extreme yield magnitudes in those years. We show the final results in the Tables S4 and S5. 5. Countries those repeatedly appeared in volatile years (step 3) are designated as most and least volatile countries, respectively, and are shown by thicker boundaries in the Figures 3 and 4.

SM13. Data on global wheat croplands coverage:
In order to get the total coverage of wheat croplands we combined both irrigated (Figure S12-a) and rainfed grids ( Figure S12-b) 33 . We used the combined map ( Figure S12-c) to study local climate variability. We re-gridded every local climate data (PDSI and ATa) to match spatial scales of Figure S12-c.
SM14. Limitations to include wheat growing season in climate analysis: Wheat is not grown at fixed calendar dates and growing durations vary across the world 34 . Our analysis incorporated annual average climate variability, both at local as well as larger-scales, as opposed to locallyconstrained "growing season", due to the following reasons: First, the global yield dataset was complete and consistent for 85 countries from 1961-2013, on the annual time scale and national spatial scales. The same dataset did not separate out spring and winter wheat yields at the desired time span and/or spatial conditions. There was one higher (spatial) dimension yield dataset available, but for 1961-2008 with highest spatial dimension possible only for 17 countries in total, and so was not useful for our global assessment 35 . The data developers 35 also admitted that the frequency of the data reporting varied among countries and there were no separations in seasonal yields. Therefore, to confidently meet our research objectives, i.e. investigating co-variability in multi-national yields (and coinciding volatility), as many climates concurrently vary across the globe as a function of global climatic drivers, the best yield dataset we could possibly use was the FAO data at annual and national averages, which was complete and consistent for 85 nations over 1961-2013.
Second, we have used grids demonstrating combined irrigated and rainfed croplands across the globe ( Figure S12) but did not use growing season dates specific for those grids. We obtained data on seasonal cropping calendars for both spring and winter wheat varieties from Sacks et al. 34 but we were unable to use that information for our climate analysis due to the below limitations: 1. SAGE dataset (http://www.sage.wisc.edu) provides mean growing season information for winter and spring wheat varieties (average planting and harvesting dates), focusing mainly on rainfed croplands 34 ( Figure S13). Hence, this dataset missed out on many of the crucial croplands we have included in our study. Figure S12(c) indicates those included in our study (combined irrigated and rainfed croplands) across the globe while Figure S13 provides growing season information from the SAGE dataset. We particularly see that our cropland coverage in Figure  S12 provides a much higher resolution and extent for local climate analysis, where a range of crucial wheat growing areas in south Asia (e.g. India, China) and eastern Europe are not covered in SAGE ( Figure S13). We already faced earlier limitations due to yield data (un)availability within desired specifications and time frames, where Russia was excluded from our analysis due to Soviet Union breakdown issue. It was hence imperative not to lose more important producers from our analysis due to unavailability in growing season data. Inclusion of SAGE growing season dataset would only lead to exclusion of more countries in the eastern Europe and India from climate analysis, countries which were important for the PCs.
2. As evident from Figures S12 and S13, SAGE crop calendar observations generally applied to larger geographic regions, where, most observations were specified either for an entire country or for a fairly large sub-national units (e.g. at state-level within the USA). Countries, such as the USA, European region, Turkey, and China produces both winter and spring wheat ( Figure S13), but yield data for these countries were mostly available as annual averages. In such a scenario, deciding on which growing season to consider for local climate analysis was difficult. Furthermore, for larger-scale climate analysis to discover the common influence, we could only consider "one" consistent time scale. As a result, we ended up selecting "annual" time scale as the one and only that would be coherent across yield and climate variability. By incorporating both concurrent and previous year's climate variability we could possibly capture variable starting / harvesting dates ( Figure S13) of local growing seasons to some extent. This, in our view, made sense, as Sacks et al. 50 indicated that wheat planting dates can be highly determined by a region's climate variability that is variable within and across countries.
3. SAGE does not capture any seasonal timing changes/shifts or variability within a given region. In reality, planting dates vary through space and time based on changes in the weather and climate (e.g. north-south gradient following temperature gradient), and also due to non-climatic factors such as soil properties, cultivar choice, farm management, changes in technological and socio-economic factors 34 .
4. Finally, Sacks et al. 34 also mentioned that there may be some observations within SAGE dataset, where the growing season observations lacked an explicit label of "winter wheat" or "spring wheat". They also acknowledged that in tropical and subtropical regions this scheme could lead to mis-classification of spring vs winter cereals, which could be solved if a minimum temperature threshold was included to distinguish between either type, as winter variety requires cold temperatures for vernalization; however, the data developers 34 were not able to identify a robust threshold due to the reason that vernalization requirements differ between cultivars. As a result, Sacks et al. 34 made a strong cautionary note to the data users that "..our data set should not be used to determine which regions actually grow winter versus spring cereals.        Figure 2, countries whose loading values are greater (or smaller) than this one standard deviation values are highlighted by the orange or blue color boundaries, indicating highly co-varying countries in yields (names are indicated in Table S1).  Fig S2). Countries that are not within the scope of this study are highlighted by gray color .   Fig. S8. Country-specific growing area (in percentage) influenced by the local climate variability (concurrent-year ATa, lagged-ATa, concurrent-year PDSI, and lagged-PDSI variability), as exhibited by significant Spearman rank correlations at 95% level between local climate indicators and leading PCs (as displayed in Figure 2). Only PCon# are marked here.      Tables   Table S1. Countries corresponding to each PC (orange and blue highlights in Figure 2), and their ranks as producer, exporter or importers (e.g. 17e for Austria indicates country rank 17 as an exporter, method to choose country ranks is discussed in section SM11). Countries showing significant local climate influence in Figure 2 (correlations between PCs and local climate indicators) is denoted here by a star within the orange & blue highlights. Table S1. Countries contributing to each PC (orange and blue highlights), and their ranks as producing-, exporting-or importing-country (e.g. 17e indicates rank 17 as an exporter). Countries showing any degree of local climate influence is indicated by an additional star with orange and blue highlights.  Table S2. Spearman rank correlation coefficients between each PC time series and standardized national yields in L matrix (also shown on a map in Figure S7). Correlations (r), once squared (r 2 ), indicate to the extent by which each PC explains national yield variance or vice versa. For example: the global variability captured within PC1 (10.5%) associates highly with Austria but the same with the other 3 PCs is minimal. On the other hand, PC1 illustrates a higher correlation with Bulgaria's yields (0.74) while PC4 indicating an opposite type of influence.

Table S2. Correlations between each PC time series and standardized national yields, after removing outliers by rPCA (low ranked matrix). Correlations indicate magnitudes of association between each national interannual yield and a specific mode of global yield variability (PC).
For example: the global variability captured within PC1 (10.5%) associates highly with Austria, with a correlation coefficient of 0.71 while this country has a lesser degree of association within the other 3 top PCs. On the other hand, for Bulgaria, PC1 is most important while PC3 the least. For both France and Germany it is both PC1 and PC4 that are important while PC4 exerts lesser degree of influence than PC1.      Table S5. World-wide least yield volatile years within 1964-2010 and countries contributing to their volatility. Years when 10 smallest co-incidence of extreme yield departures (countries) happened from the global mean indicating least yield-volatile years (MVY) (absolute national yields greater than 1 standard deviation estimate of all the 85 national yields within the same year). These least extreme 10 years are also used to make composite maps of PDSI, ATa, SSTa, and GPH - Figures 3 and 4.  Table S6. A list with hPCon# countries in which a 1-year lagged local climate indicator is relevant ( Figure 2) due to crop calendar ( Figure S13). Crop calendar data is bi-weekly, from source: https://asheshwor.shinyapps.io/cropcal/ based on SAGE data 50 on country basis (https://nelson.wisc.edu/sage/data-and-models/crop-calendar-dataset/index.php).

Country
Figure S1