Introduction

Economic theory holds that asset returns are functions of the state variables of the real economy and that the real economy displays business-cycle fluctuations. If the quantity and price of aggregate risk are linked to economic fluctuations, one can expect return predictability to exist (Campbell and Shiller, 1988; Fama and French, 1989; Campbell and Cochrane, 1999). However, in practice, forecasting stock returns remains notoriously difficult, although many economic variables have been developed for this purpose in the literature. The predictors examined include dividend–price ratio, dividend yield, earnings–price ratio, dividend–payout ratio (Campbell and Shiller, 1988), stock volatility (Guo, 2006), book-to-market ratio (Kothari and Shanken, 1997; Pontiff and Schall, 1998), T-Bill rate (Campbell, 1987), term spread (Fama and French, 1989), and inflation rate (Fama and Schwert, 1977). Welch and Goyal (2008) systematically examine the forecasting performance of popular economic indicators in published papers and find no evidence that any of them can significantly beat the no-predictability model (historical average). Goyal et al. (2021) further confirm this finding based on the reexamination of the forecasting performance using new predictors developed in recent literature.

Apart from challenging modeling issues, there are three reasons for the failure of macroeconomic data to forecast stock returns. First, macroeconomic indicators such as the consumer price index (CPI) and gross domestic production (GDP) are always published with delays, even if the government data are credible. Therefore, one has to execute real-time predictions of future returns using lagged macroeconomic data instead of current-period data. The use of lagged data undermines forecasting performance. Second, many macroeconomic indicators undergo revisions after initial publication. Using unrevised data in real-time biases the forecasting results. Third, traditional economic data are publicly available, and there is virtually no cost apart from basic data processing equipment to acquire the information. When this new information enters the market, it is instantaneously integrated into the price if the number of investors trading on it is sufficient (Jensen, 1978). As a result, the predictive value of announcement-based economic data is extraordinarily short-lived due to its public accessibility. In addition, macroeconomic data are published at monthly or quarterly frequencies, making it difficult to forecast stock returns at daily or weekly frequencies. Therefore, frequent data that contain economic information that is not easily available through simple searches in real-time can be expected to improve the forecasting of stock returns.

Recent technological advances in artificial intelligence and big data have revolutionized information collection. Hedge funds have turned to satellite technology to gain a real-time information advantage in understanding economic activity. They use satellite imagery to obtain information on mines, ports, plantations, or farmland before making investments. Several recently established companies provide satellite-based forecasts of economic indicators. For example, Orbital Insight, Planet Labs, Spire Global, and Space Know track industrial facilities, real estate properties, foot traffic activity, oil refineries, petrochemical plants, and auto manufacturing centers to generate information about commercial properties. In 2016, 70 of the 74 clients of Orbital Insight, the biggest geospatial analytics company, were hedge funds (see https://orbitalinsight.com). Institutional investors increasingly attempt to glean investment insights from such imagery.

We use information from satellite images of container ports to predict global stock market returns. Economic globalization depends on the rapid and efficient movement of goods via containerization. Globally, 90% of non-bulk dry cargo is now shipped by container. The number of containers at a port can be regarded as an indicator of macroeconomic information. The rationale is that an increase in the number of containers stacked in the port implies decreased demand for shipping service, and thus predicts lower economic activity. Stock prices are expected to fall accordingly. We extract real-time information from the Sentinel-2 images for the top 48 container ports (as ranked by throughput) from the European Space Agency (ESA).

Our database contains 83,672 temporally and spatially matched multispectral daytime 10 m/pixel images from Sentinel-2 satellites. The database covers the period from January 1, 2017, to November 1, 2021. To segment the container areas, we train the U-Net model (Ronneberger et al., 2015) with 3711 hand-labeled images in 2017. In this way, we can obtain a series of container coverage areas at a daily frequency for each of the 48 ports under consideration. The container area series are used to predict daily stock returns on 33 stock indices in major countries. Our results indicate that the combined container area information reveals significant return predictability for most of the 33 markets for the period from 2019 to 2021. Investment strategies based on the container information generate an economically considerable profit, with an annualized return of 16.38% and a Sharpe ratio of 1.19. Extended analysis shows the close links between the predictive power of satellite-based data and the macroeconomy.

The ability of our satellite-based container information to predict stock returns can be explained by its ability to anticipate economic activity. Global marine trade links production activity and the consumption of goods and is thus highly informative about economic activity. We investigate the predictive ability of the number of containers in relation to the growth of industrial production. Our results suggest a negative predictive relationship in 27 out of 28 countries at the horizon of four months, and this predictive ability is statistically significant in 15 cases. The significantly negative predictive relationship is also found when regressing the world average growth rate of industrial production on the past change in container numbers. The number of containers has a greater ability to predict industrial production during the COVID-19 period, echoing the stronger return predictability after the pandemic.

We compare the information content of our satellite-based container number data with popular shipping data, such as the freight rates indicator (Kilian, 2009) and container throughput (Döhrn and Maatsch, 2012; Döhrn, 2019; Kilian et al., 2021). We find that the global number of containers significantly predicts these two shipping indicators. Our data lead the traditional indicators for the horizons of 2 months. The strong relationship between our container indicator and existing indicators is not affected by the COVID-19 pandemic.

The advantages of satellite imagery as a data source for economic studies have been documented in the literature. Several studies use satellite image data to measure economic variables including GDP growth (Henderson et al., 2012), economic inequity (Chen and Nordhaus, 2011), income distribution (Mirza et al., 2021), sustainable development (Burke et al., 2021), and rural household poverty (Jean et al., 2016; Watmough et al., 2019). Most of these studies use satellite data on a night light, which makes it possible to compare economic activity across different areas. For a single area, night light data show minor variance over time, and thus they are not appropriate for time series prediction analysis. Recent studies reveal that the satellite imagery from Orbital Insight of parking lots can anticipate retailer sales performance that is not yet announced and mute price reactions to earnings announcements (Katona et al., 2018; Zhu, 2019). Unlike these studies, which rely on ready-made forecasts from commercial satellite companies, we build our own database from public satellite data. This is an important difference because the high cost of commercial satellite data makes them inaccessible to many retail investors. The return predictability based on public satellite data is useful to many more market participants and has stronger economic implications.

The application of satellite data is also found in a few finance literature. Katona et al. (2018) use parking lot traffic signals extracted from satellite imagery and find that unequal access to satellite data increases information asymmetry among market participants. Zhu (2019) finds that satellite-based estimates of normalized car counts in parking lots of retailers predict earnings that are not yet announced. Mukherjee et al. (2021) show that after the introduction of satellite-based imagery data, oil price responses to government announcements of oil inventory are smoother. As a contribution, we directly show that the satellite-based estimates of the number of containers predict world stock returns, enriching the literature on the application of satellite data in empirical asset pricing studies.

The remainder of this paper is organized as follows: Section “Data” shows the details of satellite data processing and the stock returns data. Section “Forecasting results” reports the forecasting results. The section “Understanding the source of return predictability” gives some explanations about the source of return predictability. Section “Discussion” performs discussions on the application of satellite data. The last section concludes the paper.

Data

The identification of containers in ports

We collect publicly available and freely distributable satellite imagery from the Sentinel-2 mission. The dataset consists of 83,672 RGB images of 48 major ports from January 1, 2017, to November 1, 2021. Figure 1 provides the global distribution of those container ports. The details about satellite imagery processing and model training are given in Appendix.

Fig. 1: Global distribution of container ports.
figure 1

Location of top 48 container ports ranked by throughput (TEU) data in 2020 obtained from the Institute for Shipping Economics and Logistics (ISL). TEU stands for the 20-foot equivalent container. The darkness of each point represents its throughput (units: 10 million TEUs), where the port with the higher throughput is darker.

The identification of containers in the ports can be treated as a semantic binary segmentation task, which is an increasingly popular domain in computer vision. A semantic binary segmentation task takes an image as input and outputs binary classification results for each pixel in the image. In this task, the two classes are “container” and “non-container.” Note that containers are usually stacked in several layers to save floor space. However, the number of layers cannot be recognized precisely from the Sentinel-2 satellite images due to the limitation of resolution. We arbitrarily assume that different container stacks have the same number of layers. In this way, we count the number of pixels in each satellite image that are classified as “container” and take this as the proxy of the number of containers in the port. Changes in the number of containers can reflect the dynamics of global economic activities.

Specifically, we use U-Net (Ronneberger et al., 2015), a conventional deep-learning model for semantic segmentation tasks, to identify containers from satellite images. As a variant of convolutional neural networks (CNN), U-Net uses a unique U-shaped architecture and skip connections to capture multi-scale contour information. U-Net has been shown to be efficient in cell segmentation tasks (Ronneberger et al., 2015). After multiple iterations and improvements (Dolz et al., 2018; Zhou et al., 2020), it shows excellent performance in medical imagery semantic segmentation tasks like CT pancreas segmentation (Oktay et al., 2018) and cancer detection (Huang et al., 2021). In recent years, U-Net has become a major image segmentation method in various research areas, especially satellite image segmentation. For example, researchers have used U-Net to locate photovoltaic solar energy-generating units from space (Kruitwagen et al., 2021) and to forecast seasonal arctic sea ice (Andersson et al., 2021).

We construct a unique training set for our U-Net model. Traditional satellite or aerial data set image semantic segmentation tasks rely on precise standard datasets. For example, the Massachusetts roads data set provides images covering more than 2600 square kilometers and precisely labels the shape of every road in each image. However, because of the lack of research into container identification in satellite images, a standard dataset for container recognition is not yet available. Therefore, it is necessary for us to reconstruct a dataset oriented toward our task. Specifically, we pick out all Sentinel-2 satellite images in 2017, the earliest year in our dataset. After abandoning images with 5% or greater cloud coverage, we label the remaining 3711 images by hand. We identify “container” or “non-container” areas of each image, and use this as our dataset for U-Net model training. As we use data from 2017 to train the model, and then identify images and predict stock returns after that period, this procedure avoids forward-looking bias due to the application of future information.

We train different U-Net models to find the one that best fits our task by selecting two hyperparameters: the size of input images and the depth of the network. First, given the convolution kernel size (normally 3 × 3), the input size of images influences the perception ability of the convolution kernel. Second, the depth of the network, which is usually referred to as the number of convolution layers in U-Net, determines the level of contour information that can be used in the model. A deeper layer generally corresponds to a higher level of information. Table 1 shows the analysis results of those two hyperparameters. We find that the highest identifying accuracy is achieved when inputting medium-size images (480 × 480 pixels) to a deeper network structure (23 convolution layers). The preference for deep layers echoes the simplicity of the container stack shape because a deeper network leads to a better capacity for abstraction, which works best when the object contour is simple. Finally, our selected model for identification achieves 93.20% accuracy, 92.45% recall, and 92.81% F-score in the testing set, demonstrating good performance.

Table 1 Experiment results of the U-Net model.

We then measure the container coverage areas, which are our proxy for container numbers, from each image based on our training model. It is essential to evaluate if our model performs consistently for each image. Thus, we visualize and analyze the global spatial distribution change in the number of containers over time for a better understanding of model performance in practice. A simple test is to see whether the identification results for the spatial distribution of containers in 2017 look similar to the true spatial distribution in that year. If so, we can conclude that the model performs consistently well. We do another test using the exogenous shock caused by COVID-19. Ports in the United States faced severe congestion due to COVID-19 (Meeks et al., 2021), as a lack of truck drivers and other laborers caused a huge number of containers to pile up. If our model performs well, it will capture a significant growth in the number of containers in 2021.

Figure 2 shows the identification results of our U-Net model for the Port of Los Angeles, the largest port in the United States. The first column introduces the Sentinel-2 satellite image, and the yellow areas indicate seven subregions of the container port. The subgraphs in the second column show the container distribution in the training set in 2017, in which the color of each pixel indicates the percentage of pixels classified as “container”; a brighter color represents a higher percentage. The third to last columns show the recognition results for our model from 2017 to 2021. The results of the two tests can be found in Fig. 2. First, the similarity of the second and third columns reveals the consistency of the training labels and recognition results. Second, the abnormal growth in the number of containers in 2021, as shown in the last column, is consistent with the port congestion in 2021 (Meeks et al., 2021). Both pieces of evidence demonstrate the stability of our model.

Fig. 2: Container recognition results for Port of Los Angeles.
figure 2

The first column shows satellite images at 7 different parts of Port of Los Angeles (LA) and the yellow areas indicate the container yards. The second column is the ground truth of containers’ spatial distributions in 2017, labeled by hand. The brighter color of pixels means a higher container frequency in one specific year, and the annual average frequency is marked at the bottom of each diagram. Container recognition results by U-Net are shown from the third column to the last column, representing results from 2017 to 2021, respectively.

To predict the stock market return, we calculate the daily average change in the number of containers in the ports (GNC), which is defined as \({\rm {GNC}}_{i,t} = \frac{{\log \left( {{\rm {NC}}_{i,t}} \right) - {\rm {log}}\left( {{\rm {NC}}_{i,s}} \right)}}{{t - s}}\), where NCi,t represents the number of containers in port i at time t, and s represents the most recent date before time t for which cloud-free (clear) satellite imagery is available. Dividing the log difference by t−s is used to standardize the daily change in the number of containers, which eliminates the influence of the uneven distribution of observations over time. Intuitively, an increase in GNC indicates higher port congestion and lower trading volume, which heralds future economic downturn and predicts lower stock returns.

Stock returns data

Our dataset of international stock returns is related to 33 market-level indices from 28 countries, including 18 developed markets and 12 emerging markets. We collect the daily stock returns from January 2017 through November 2021 from the Wind Database. Table 2 presents the details of the stock index under consideration.

Table 2 International stock index.

We recursively generate daily stock returns from January 2019 through November 2021, and therefore the data from January 2017 through December 2018 are used as the initial estimation sample. The selection of this initial estimation sample involves a trade-off. On the one hand, we require more initial data to get a more reliable regression estimate in the process of computing the first return forecast. On the other hand, we also require a longer out-of-sample evaluation period to obtain a more accurate evaluation result. As a compromise, we select 40% of the data as the initial estimation sample. In this way, our out-of-sample period covers January 2019 through November 2021, which spans periods both before and after the COVID-19 pandemic. To check the robustness of our results to the sample selection, we re-examine the forecasting performance using an alternative initial estimation sample from January 2017 to December 2017. The results are shown in the online appendix.

Stock indices trade in different time zones. Due to such consideration, for each stock index, we convert the Sentinel-2 image UTC time to the local time to match the data. We use the satellite imagery available within 24 h before closing time to predict the close-to-close return on the next business day, which avoids the introduction of forward-looking information.

Competing shipping indicators

We investigate the linkages between our container number index and existing shipping indicators including the container throughput index and baltic dry index (BDI). Although these indexes are published with lags, they can well capture global economic activity (Kilian, 2009). Figure 3 plots the global number of containers (GNC) index, BDI, and the container throughput index (RWI_ISL). We can observe that the GNC index changes inversely with competing shipping indicators.

Fig. 3: The global number of containers index.
figure 3

GNC represents the global number of containers calculated by summing the numbers for all 48 ports. For comparison, we also plot the baltic dry index (BDI) and container throughput index collected by the Leibniz Institute for Economic Research (RWI) and the Institute for Shipping Economics and Logistics (ISL). The data are min–max normalized between 0 and 1.

Forecasting results

Forecasting procedure

Following the majority of papers on return predictability, we assume a linear relationship between the GNC and the stock index return (Goyal and Welch, 2003; Welch and Goyal, 2008). The specification of predictive regressions for stock returns on the lagged predictor variable of interest can be written as follows:

$$r_{t + h} = X_t \cdot \beta + \varepsilon _{t + h},\,t = 1,...,T - h$$
(1)

where rt+h represents the average daily continuously compounded stock returns in excess of the risk-free rate from time t to t+h; Xt represents a vector that consists of a return predictor and an intercept; and εt+1 is the error term, \(\varepsilon _t\sim {\rm {i.i.d.}}\left( {0,\sigma _\varepsilon ^2} \right)\). The parameter estimates of the predictive regression can be simply obtained via ordinary least squares (OLS). As investors do in practice, we execute one-step-ahead forecasting to generate out-of-sample return predictions. Specifically, the return forecasts from univariate predictive regression can be written as follows:

$$\widehat r_{t + h} = X_t\widehat \beta _t,t = M,\,...,\,T - h$$
(2)

where \(\hat \beta _t\) is the parameter estimates of β using the information until time t. At time t, we regress \(\left\{ {r_\tau } \right\}_{h + 1}^t\) on the \(\left\{ {x_\tau } \right\}_1^{t - h}\) to obtain parameter \(\hat \beta _t\) via the OLS method. The parameter estimates are updated at each point in time for t ≥ M using extending windows, where M denotes the initial sample length to execute parameter estimation.

Next, we integrate multiple informations from the global ports by pooling the individual return forecasts. It has been shown that combining forecasts is an effective method for extracting information from high-dimensional predictors in economic forecasting (Timmermann, 2006). Forecast combinations use the weighted average of forecasts from individual models, as given by

$$\widehat r_{t + h,{\rm {comb}}} = \mathop {\sum}\nolimits_{i = 1}^N {\widehat \pi _{t + h,i}\widehat r_{t + h,i},\quad t = M,\, \ldots ,\,T - h}$$
(3)

where \(\hat r_{t + h,i}\) denotes the forecasts from model i, \(\hat \pi _{t,i}\) is the ex-ante weight assigned to model i formed at time k, and N is the number of predictive models. In this paper, we consider an equal-weighted mean combination, which uses equal weight \(\hat \pi _{t + h,i} = 1/N\). Although this weighting scheme is simple, recent empirical and simulation studies have shown that it is not necessarily outperformed by more sophisticated combinations (Smith and Wallis, 2009; Claeskens et al., 2016). Note that this paper focuses on plain OLS forecasting techniques and a naïve combination strategy, in the interest of straightforwardly testing the predictive power of the new GNC indicator.

Statistical predictability

We examine whether satellite-based container data are helpful in predicting global stock market returns out-of-sample. The out-of-sample forecast at time t + 1 is made using the data available up to time t. The out-of-sample R2 (\(R_{{\rm {OoS}}}^2\)) is used to evaluate the forecasting performance, defined as follows:

$$R_{{\rm {OoS}}}^2 = 1 - \frac{{\mathop {\sum}\nolimits_{t = M = h}^T {\left( {r_t - \hat r_t} \right)^2} }}{{\mathop {\sum}\nolimits_{t = M + h}^T {\left( {r_t - \bar r_t} \right)^2} }}$$
(4)

where \(\hat r_t\) is the return forecast, \(\bar r_t\) is the prevailing mean forecast, and rt is the realized return at time t. Therefore, \(R_{{\rm {OoS}}}^2\) quantifies the reduction of the forecasting loss of the given model relative to the forecasting loss of the benchmark model. A positive \(R_{{\rm {OoS}}}^2\) implies that the given model outperforms the benchmark model. We use the common benchmark of the historical average, which is typically hard to beat (Welch and Goyal, 2008). The statistic developed by Clark and West is applied to test the statistical significance of return predictability (Clark and West, 2007).

Figure 4 plots \(R_{{\rm {OoS}}}^2\) values of the equal-weighted forecast combinations using univariate predictive regression with the change in container numbers in each of the 48 ports. The combination model that aggregates information from global ports dominates the no-predictability benchmark of the historical average at horizons of up to 5 days. Specifically, at the horizon of 1 day, we observe positive \(R_{{\rm {OoS}}}^2\) in all markets, and 27 of these values are statistically significant at the 10% level. The average daily \(R_{{\rm {OoS}}}^2\) of 33 markets can reach 0.0529%. This powerful predictive ability can also be seen for longer forecasting horizons. The \(R_{{\rm {OoS}}}^2\) values are positive in all cases, with average magnitudes of about 0.05%. The long-horizon predictability suggests that the information from satellite imagery cannot be absorbed into the price immediately but is digested slowly.

Fig. 4: Forecasting performance evaluated by out-of-sample R-square.
figure 4

This figure plots the scatter of the out-of-sample R-square for different stock market indexes. The forecast quality is evaluated by the out-of-sample R2 (\(R_{{\rm {OoS}}}^2\)) defined by the percent reduction of mean squared prediction error of the given model relative to the benchmark model of historical mean, \(R_{{\rm {OoS}}}^2 = 1 - \mathop {\sum}\nolimits_{t = M + h}^T {\left( {r_t - \hat r_t} \right)^2} /\mathop {\sum}\nolimits_{t = M + h}^T {\left( {r_t - \bar r_t} \right)^2}\), where \(\hat r_t\), \(\bar r_t\) and rt are the return forecasts from the model of interest, prevailing mean forecasts and the realized returns, respectively. The full sample period is from 2017-01-01 through 2021-11-01 and the out-of-sample period starts on 2019-01-01 (i.e., M = 500). We multiply the \(R_{{\rm {OoS}}}^2\) by 100 to denote percentage values. We measure the statistical significance relative to the prevailing mean model using the Clark and West (2007) test statistic. The cases significant at 10% level are highlighted in red.

We carry out statistical inference by testing whether the given model forecasts yield significant improvements over the benchmark forecasts. The asymptotic statistics suffer from the problem of small-sample bias. In addition, they may have the wrong size even when small-sample bias is considered. Due to such considerations, we follow Mark (1995) in using a bootstrap-based Diebold and Mariano (1995) statistic to examine the significance of the forecasting improvement. Specifically, we execute statistical inference using a stationary bootstrap procedure from Politis and Romano (1994) under the null hypothesis that the equity premium is unpredictable. The Diebold and Mariano (1995) (DM) statistic for testing equal predictive ability between the given model and benchmark model is given by

$${{{\mathrm{DM}}}} = \frac{{\overline {f_t} }}{{\sqrt {\frac{{\hat S_{f_t}}}{{N_{f_t}}}} }}$$
(5)

where \(\overline {f_t} = \frac{1}{{N_{f_t}}}{\sum} {\widehat {f_t}}\), \(\hat S_{f_t} = \frac{1}{{N_{f_t}}}{\sum} {\left( {\widehat {f_t} - \overline {f_t} } \right)^2}\) and \(\widehat {f_t} = \left( {\bar r_t - r_t} \right)^2 \,- \,\left( {\hat r_t - r_t} \right)^2\). The number of resamples is set as 2000 when bootstrapping, and the block length is optimally estimated from the data using the selection procedure of Patton et al. (2009).

As shown in Table 3, we find that the forecast improvement of the GNC model is statistically significant for most cases after accounting for size distortion. The bootstrap-based DM tests show a significantly positive \(R_{{\rm {OoS}}}^2\) at the 10% level for 29 of 33 markets at the horizon of one day. This finding holds for longer forecasting horizons. Overall, the forecast improvement of the GNC model is consistent across different time periods and is insensitive to sample size distortion.

Table 3 Forecasting improvement significance evaluated by bootstrap-based Diebold–Mariano (1995) statistic.

We further use an alternative evaluation, success ratio (SR), which measures how often the model generates forecasts with the correct sign. The criterion is given by

$${\rm {SR}} = \frac{1}{{T - M - h + 1}}\mathop {\sum}\nolimits_{t = M + h}^T {l_t,l_t = I\left( {\left( {r_t - \hat r_{t,{\rm {model}}}} \right)^2 \,<\, r_t^2} \right)}$$
(6)

The SR criterion is less sensitive to the \(R_{{\rm {OoS}}}^2\) metric. The PT test (Pesaran and Timmermann, 2009) is used to examine whether the success ratio of each model is significantly >50%. Figure 5 exhibits the corresponding results. We find that the combined information from individual ports correctly predicts the sign of the change in the market index more frequently than tossing a coin in most cases. The success ratios are higher than 0.5 in the case of 27 markets for the forecasting horizon of one day, and 23 of these values are significant. The models are also more successful when predicting the direction of change than a random-walk benchmark at longer horizons, although the directional predictability is slightly weakened.

Fig. 5: Forecasting performance evaluated by directional accuracy.
figure 5

The forecast quality is evaluated by success ratio (SR) defined by how often the model generates forecasts with the correct direction, \({\rm {SR}} = 1/\left( {T - M - h + 1} \right)\mathop {\sum}\nolimits_{t = M + h}^T {l_t} ,\,l_t = I\left( {\left( {r_t - \hat r_{t,{\rm {model}}}} \right)^2 \,<\, r_t^2} \right)\), where \(\hat r_{t,{\rm {model}}}\) and rt are the return forecasts from the model of interest and the realized returns, respectively. The full sample period is from 2017-01-01 through 2021-11-01 and the out-of-sample period starts on 2019-01-01 (i.e., M = 500). We measure the statistical significance using the Pesaran and Timmermann (2009) test statistic. The cases significant at the 10% level are highlighted in red.

Forecasting performance over time

We check the predictive power of container data over time. A concern is that the model cannot consistently beat the benchmark but only show predictive ability during short periods. To address this issue, we calculate the cumulative sum of squared prediction error difference (CSSED) proposed by Welch and Goyal (2008), defined as follows:

$${\rm {CSSED}}_{{\rm {model}},t} = \mathop {\sum}\nolimits_{\tau = M + h}^t {\left( {e_{{\rm {bench}},\tau }^2 - e_{{\rm {model}},\tau }^2} \right),\quad t = M + h, \ldots ,T}$$
(7)

where ebench,τ and emodel,τ denote time τ forecast errors associated with the historical average benchmark model and the given model, respectively. The CSSED measure has become a standard indicator in the finance literature for evaluating out-of-sample return predictability (Goyal et al., 2021). The CSSED curve can illustrate whether a model of interest produces more accurate return forecasts than the benchmark model for any given evaluation period by redrawing the horizontal zero line to the beginning of the out-of-sample period. Intuitively, when the forecast generated from the given model outperforms the forecast generated from the benchmark model at time t + 1, the CSSED increases from time t to t + 1.

Figure 6 plots the CSSEDs of the combined GNC models relative to the historical average benchmark. At the horizon of 1 day, all the curves have slopes that are predominantly positive, suggesting that the forecasts conditional on the number of containers consistently outperform the prevailing mean benchmark over time. At longer horizons, the CSSED curves are also positively sloped and have less frequent falloffs. The smoother line illustrates the higher robustness of the predictive power of the container information in forecasting long-term returns. More importantly, we find an interesting pattern in all markets, in that the CSSEDs jumped upward during the global depression caused by COVID-19, especially in the American and European markets. The evidence indicates much stronger return predictability during the COVID-19 recession.

Fig. 6: Cumulative sum of squared prediction error differences relative to the historical average forecasts.
figure 6

This figure plots the cumulative sum of squared prediction error difference (CSSED) over time of the model of interest relative to the benchmark of the historical average model. The CSSED is computed as \({\rm {CSSED}}_{{\rm {model}},t} = \mathop {\sum}\nolimits_{\tau = M + 1}^t {\left( {e_{{\rm {bench}},\tau }^2 - e_{{\rm {model}},\tau }^2} \right)}\), where ebench,τ and emodel,τ denote time τ forecast errors associated with the historical average benchmark model and the given model, respectively.

The better performance of the container indicators during the COVID-19 period can be linked to the supply disruptions caused by friction in the container shipping market. The COVID-19 pandemic produced a historic global crisis. To prevent an outbreak, many countries went into lockdown and imposed unprecedented containment measures. These led to unexpected frictions in container shipping markets, such as port congestion, shipping delays, or shortages along the supply chain, which seriously reduced economic activity. These frictions increased suddenly in early 2020, as the pandemic spread across China and then the rest of the world; lockdowns disrupted global supply chains and reduced the volume of container shipping.

Figure 3 exhibits a sharply increasing number of containers stacked in port, implying blocked shipping capacity. Unexpected shocks along supply chains prevent demand from being realized, particularly for durable goods (Notteboom et al., 2021). For example, the U.S. domestic demand component of overall goods consumption dropped approximately 20% in early 2020, and durable goods consumption dropped more than 30% (Kilian et al., 2021). Kilian et al. (2021) show that the primary determinant of the economic contraction in developed countries in early 2020 was the sharp drop in domestic demand. In contrast to the Great Recession, the change in consumption played a more important role in determining real economic activity during the COVID-19 crisis period. Using data on container trade volume in North America, Kilian et al. (2021) provide quantitative evidence of the impact of friction in the container shipping market on the real economy during the COVID-19 crisis. They find that these frictions led to an abrupt drop in container trade, triggering an economic recession in the U.S. in March 2020. Furthermore, they find that the recovery of North American container trade in late 2020 was substantially driven by lower frictions in the container shipping market rather than by a recovery of U.S. import and export demand. Figure 6 shows a pattern in which the forecasting performance of the satellite-based container indicator improves during the COVID-19 pandemic. The CSSED curves jump for most stock market indices in March 2020, echoing the effects of shocks to container market frictions on economic activities.

We find that the CSSED jump is more prominent in the U.S. and European markets. This finding can be explained by the heterogeneous effects of lockdown policies on different countries. Starting in mid-January 2020, China implemented an unprecedented series of national lockdown policies to contain the spread of the virus. Because of the interconnected global shipping network, the local shock in China propagated to other regions and became a global shock. Bai et al. (2022) use data derived from the Automatic Identification System to quantify the impact of pandemic lockdown policies on global port calls. They find that in February 2020, container shipping at ports with high levels of connectivity to Chinese ports was significantly affected by Chinese lockdown policies, with a time lag of two to three weeks, depending on the voyage duration of a container ship. The effects of China’s lockdown on close neighbors were less serious than the effects on highly connected ports in distant regions. As a result, in mid-March 2020, significant supply chain breaks occurred in European and American ports but not in Asian ports. The unexpected frictions in container shipping markets caused a sharp drop in domestic demand in these countries, which explains the jump in return predictability revealed by the satellite-based container indicator.

Economic significance of return predictability

While the container number data show statistically significant return predictability, investors are more concerned about whether they can generate economic gains. Following the literature (Welch and Goyal, 2008; Rapach et al., 2010), we consider a risk-neutral investor who executes zero-investment strategies that invest in a risky asset (stock market index) financed with bills or in a risk-free asset (bills) financed by shorting the stock index. A popular benchmark is the unconditional investment strategy, namely buy-and-hold, in which the investor holds the stock index all of the time and only earns the equity premium itself. Although the buy-and-hold approach is naïve and no-timing, it is rarely dominated by sophisticated timing investment strategies conditional on various indicators, as identified in previous papers (Goyal et al., 2021).

In contrast, if the investor has information regarding the number of containers, he/she can seek time investments. In order to achieve a more realistic performance of our portfolio strategy, we start trading with a day lag, when the satellite information is available to investors. Here, we test the economic predictability of the number of containers, and whether it can generate higher returns than the naïve buy-and-hold strategy. Specifically, we consider two timing strategies, tilted and untilted ones, according to the return forecasts. The untilted investment strategy invests in the market index using money financed by bills when the market is expected to be bullish, but shorts the market index and saves money in bills when it is expected to be bearish. This strategy judges market trends based on whether the return forecasts are above or below the historical average. Unlike the untilted strategy, the tilted strategy switches from long stocks to short only if the signal is very bearish, at the 25th percentile rather than the mean. The weight assigned to the risky asset is scaled according to the Z-score specification for both timing strategies. The forecast at a given time subtracts the prevailing mean (untilted) or first quartile (tilted) of the forecasts and then divides by the prevailing standard deviation. These two strategies earn the same as the unconditional strategy when the forecast is bullish, but the opposite when the forecast is bearish.

Table 4 reports the mean return and Sharpe ratio for the three investment strategies: buy-and-hold, tilted, and untilted. We find that both of the timing strategies that are conditional on return forecasts based on the number of containers can beat the benchmark of the buy-and-hold strategy. The untilted timing strategy outperforms the naïve strategy in 26 of 33 markets, with a mean return of about 14.85% per year and annual SR of 1.16, almost double the mean and SR of the benchmark strategy. The titled timing strategy shows even better performance and outperforms the naïve strategy in 30 of 33 markets, with an annualized return of 16.38% and SR of 1.19. Overall, the return predictability revealed by the number of containers is economically meaningful and profitable.

Table 4 Portfolio performance of the number of containers.

Understanding the source of return predictability

A comparison with existing shipping data

The satellite-based container number data can predict returns because they are forward-looking with regard to economic activity. A natural question is whether these data are more accurate than other types of shipping data used to measure economic activity, such as the freight rates indicator (Kilian, 2009) and container throughput (Döhrn and Maatsch, 2012; Döhrn, 2019; Kilian et al., 2021). If the answer is no, our container data are meaningless.

Intuitively, on the premise that the total amount of containers in economic production is time-invariant in the short term, a larger number of idle containers stacked in the yard means fewer containers in the maritime logistics supply chain. Thus, an increase in the number of containers in the port implies reduced demand for shipping services. We first test this hypothesis by checking the predictive relationship between the number of containers in the port and the container throughput. The specification of the model is given by

$$\Delta {\rm {RWI}}\_{\rm {ISL}}_{t + h} = \alpha + \beta _1\Delta {\rm {GNC}}_t + \beta _2\Delta {\rm {RWI}}\_{\rm {ISL}}_t + \varepsilon _{t + h}$$
(8)

The dependent variable RWI_ISL is the h-month change in container throughput, measured by the container throughput index. GNCt is the aggregate number of containers over all 48 ports under consideration. It is a global trade indicator collected by the Leibniz Institute for Economic Research (RWI) and the Institute for Shipping Economics and Logistics (ISL) and includes information from the 82 biggest ports in the world that handle around 60% of global container throughput. The RWI_ISL index is highly correlated with global trade measures in processed goods and is a leading indicator of industrial production (Döhrn and Maatsch, 2012; Döhrn, 2019). The explanatory variables include the lagged growth rate of the number of containers and the growth rate of container throughput.

We test the link between the number of containers and freight rates using the following equation:

$$\Delta {\rm {BDI}}_{t + h} = \alpha + \beta _1\Delta {\rm {GNC}}_t + \beta _2\Delta {\rm {BDI}}_t + \varepsilon _{t + h}$$
(9)

The dependent variable is the baltic dry index (BDI), which is a good proxy of global real economic activity in industrial commodity markets. The increase in economic activity boosts demand for shipping services, causing freight rates to grow (Kilian, 2009). Similarly, the explanatory variables are the lagged growth rate in container numbers and the lagged dependent variable.

Table 5 reports the regression results at horizons of 1–4 months. The significance of the coefficients is shown by the t-statistics based on heteroscedasticity-robust standard errors. The explanatory variable ∆GNCt is standardized. The slope coefficient of ∆GNCt is significantly negative for both predictive regressions at the horizon of 2 months. A one standard deviation increases in ∆GNCt is associated with a 27.2% decrease in container throughput changes and a 26.1% decrease in baltic dry index changes. This indicates that satellite-based information leads the traditional shipping indicators.

Table 5 Forward-looking content of satellite imagery data.

Predictive ability for real economic activity

The motivation for using port imagery to predict stock market returns comes from the close link between container shipping trade and the real economy. Maritime transport is the backbone of the global economy (McConville, 1999; Kilian, 2009; Branch and Stopford, 2013). Because 90% of the world’s non-bulk dry cargo is shipped in containers, container trade volume is considered an appropriate measure of economic activity. Several studies have shown that container trade volume is a leading indicator of global or regional industrial production (Döhrn and Maatsch, 2012; Döhrn, 2019; Michail, 2020; Kilian et al., 2021). According to the dividend discount model, asset returns are driven by time-varying discount rates, which are functions of the state variables of the real economy (Rapach and Zhou, 2013). Therefore, economic variables that measure or predict the state of the economy are useful for predicting returns (Fama and French, 1989; Campbell and Cochrane, 1999; Cochrane, 2007, 2011). Thus, we expect that container shipping trade volume can forecast stock returns if it measures economic activity well. However, shipping trade data are not available in real-time and are always released with a one- or two-month delay (Döhrn and Maatsch, 2012). Fortunately, the satellite-based technique provides an innovative information acquisition channel, through which the information can be incorporated into prices before the government’s announcement (Mukherjee et al., 2021).

We construct a real-time container shipping trade indicator based on port satellite imagery. In general, an increase in the number of containers at a port is a sign of port inefficiency, which is typically caused by congestion in transportation over urban roads, railways, and waterways (Talley, 2006). Such frictions in the container shipping markets restrict the container shipping trade. Although the literature reveals a relationship between container shipping volume and domestic demand, few studies directly link the number of containers in the yard to consumption. In contrast, the number of containers measures idle transport capacity, which is negatively related to container shipping volume. For this reason, we test whether the number of containers can anticipate real output growth. The motivation for our test is that the global marine trade links the production activity and consumption of goods and is a key node for monitoring economic activity. Therefore, the predictive ability of the number of containers may come from its prognostication of the real economy.

We create our macroeconomic growth forecasts using the following autoregressive distributed lag model:

$$\Delta {\rm {RIP}}_{t + h} = \alpha + \beta _1\Delta {\rm {GNC}}_t + \beta _2\Delta {\rm {RIP}}_t + \varepsilon _{t + h}$$
(10)

where RIPt+h is the growth rate of industrial production during the period t+h. The set of right-hand-side predictors includes the growth rate in the number of containers in the ports (∆GNCt) and the one-month lagged dependent variable (∆RIPt+h-1) used to accommodate autocorrelation.

Table 6 reports the regression results for 28 countries with horizons from 1 to 6 months. The last row of the table shows the results for the average growth rate of industrial production across these countries. It is evident that the number of containers can negatively predict the growth of industrial production in 27 out of 28 countries at the horizon of 4 months. In particular, 15 of the correlations show statistical significance at the 10% level. Consistent results are obtained when regressing the average growth rate of industrial production on the change in container numbers. Overall, we conclude that the link between the number of containers in ports and global real economic activity has a major role in the predictive ability of container numbers.

Table 6 Forecasting real industry production growth with the number of containers.

To shed light on the underlying economic mechanism, we now focus on the predictive ability of the satellite-based container indicator around the outbreak of COVID-19. In Table 5, we find that our satellite-based container indicator can negatively predict shipping indicators such as the freight rate indicator of Kilian (2009) and the container throughput indicator of Döhrn and Maatsch (2012). We investigate whether the negative predictive relationship demonstrates structural breaks around the COVID-19 crisis. Specifically, we use a predictive regression with an interaction item,

$$\begin{array}{l}y_{t + h} = \alpha + \beta _1\Delta {\rm {GNC}}_t + \beta _2\Delta {\rm {GNC}}_t \times {\rm {COVID}}_t + \beta _3{\rm {COVID}}_t\\ \qquad\quad+\, \beta _4y_t + \varepsilon _{t + h}\end{array}$$
(11)

where yt+h represents one of the two competing shipping indicators under consideration, the freight rate and container throughput. ∆GNCt is the change in the aggregate numbers of containers over all 48 ports. COVIDt is a dummy variable that takes the value of one after the outbreak of COVID-19, defined as the period after December 2019. The coefficients that we are interested in are β1 and β2. As shown in Table 7, the estimates of β1 are significantly negative at the horizon of two months, indicating that GNC predicts the traditional shipping indicators. More importantly, the estimate of β2 is not significantly different from zero. This finding suggests that the predictive relationship is not affected by the shocks of COVID-19.

Table 7 Forward-looking content of satellite imagery data during COVID-19 pandemic.

Next, we discuss the relationship between the satellite-based container indicator and real economic activity over the shocks of the COVID-19 period. Specifically, we use the following regression:

$$\begin{array}{l}\Delta {\rm {IP}}_{t + h} = \alpha + \beta _1\Delta {\rm {GNC}}_t + \beta _2\Delta {\rm {GNC}}_t \times {\rm {COVID}}_t + \beta _3{\rm {COVID}}_t\\ \qquad\qquad+\, \beta _4\Delta {\rm {IP}}_t + \varepsilon _{t + h}\end{array}$$
(12)

Panel C of Table 7 reports the corresponding estimation results. We find that the β1 estimate is significantly negative at the horizon of four months, implying that GNC leads the real economic activities at four months. The estimate of β2 is also significantly negative, consistent with the enhanced predictive power of our indicator during the COVID-19 pandemic period. In summary, GNC is a leading measure of shipping trade and its measuring ability is not affected by the shock of the COVID-19 crisis. The simultaneous enhancement of the container information’s ability to predict the stock market and global industrial output in the COVID-19 epidemic further suggests that its link to the real economy is an important source of its return predictability.

Discussion

In this paper, we investigate the ability of satellite imagery data of container ports to predict global stock market returns. The number of containers, extracted using the deep learning method, reveals significant return predictability and delivers sizeable investment gains.

Market efficiency under costly information acquisition

The efficient market hypothesis states that stock prices reflect all available information (Fama, 1970). In an environment with perfect information, the market value of information approaches zero (Copeland and Friedman, 1992; Sunder, 1992) and there is no motivation for information-gathering. However, this hypothesis relies on the assumption that information acquisition is costless. Our finding of the return predictability of satellite information is consistent with the argument for market efficiency under costly information acquisition (Grossman and Stiglitz, 1980; Verrecchia, 1982). According to this argument, an investor who spends resources on collecting information expects to receive compensation in a market where information acquisition is costly. The information acquisition leads to informed trades and improves the information content in prices, further promoting price efficiency. When informed investors observe positive information about security, they bid the current price up, and conversely, the future return will be low. In this way, the private information is quickly incorporated into stock prices. This process provides informed traders with profits and accordingly generates return predictability.

The application of satellite data increases price informativeness. In comparison with traditional data sources such as government or corporate announcements, satellite data have the advantages of real-time availability, frequency, and quality. As a result, this alternative data source provides forward-looking information about future earnings that can be incorporated into stock prices (Katona et al., 2018; Zhu, 2019; Mukherjee et al., 2021). The information about future earnings contained in satellite imagery effectively reduces insider trading. As a result, the availability of such alternative data decreases information asymmetry between firm insiders and outside investors. Managers have less opportunity to trade profitably on their private information about future earnings because a fraction of such information is available to outside investors through an alternative source. Therefore, we believe that the use of satellite data in investment decisions will improve market price efficiency.

The horizon of return predictability

We find return predictability at horizons of up to 5 days using satellite imagery data. One may wonder why satellite information is not instantly incorporated into the price—for example, on the same day that the information is released. Instead, the existence of return predictability implies a lead–lag effect.

We provide two explanations. First, the pre-processing and delivery of satellite data take time. Satellite data are usually sold as different levels of products instead of the raw data. Because the raw data only contain the original sensor signals, they cannot provide any useful insights to investors. They must be subjected to a series of pre-processing steps. For example, the transformation of the Sentinel-2 raw data into human-understandable images includes decompression, radiometric corrections, geometric viewing model refinement, resampling, and conversion to reflectance. Normally, it takes nearly 3 h to publish an early version of satellite data products and 48–60 h for carefully calibrated versions. This delay is inevitable for any satellite platform. As most raw satellite images are captured from 10:30 a.m. to 1:30 p.m. local time, the extracted satellite information usually arrives in the stock market with a one-day or longer delay.

Second, processing satellite images requires techniques that unsophisticated individual investors lack. The high threshold for processing satellite data naturally prevents most investors from using them, making their price effect gradual instead of instant. Jensen (1978) argues that when new information enters the market it is instantly integrated into stock prices if enough investors take it into account, leading to extremely short-term return predictability. However, only a portion of investors takes satellite data into account. Most satellite products and customized services are so expensive that only institutional traders can afford them. Although there are some publicly available satellite data resources accessible to individual investors, most do not have enough image-processing techniques to interpret the information in a timely fashion. Thus, the prices partially reflect the information of informed individuals (arbitrageurs) (Grossman and Stiglitz, 1980). This private information diffuses through the market as it is acquired by investors. The average private signal can be revealed by a sequence of prices, and uninformed investors condition their trades on public observables such as prices and order flows. When less-informed investors observe private signals, they may rationally behave like price chasers (Grundy and McNichols, 1989; Wang, 1993). As a result, the private satellite information is integrated into the price. It takes longer for satellite information to be incorporated into market price, based on Jensen’s (1978) argument.

Therefore, it is reasonable that our satellite-based container number index shows market return predictability at horizons of up to 5 days. The predictability across several days in fact reflects the further revelation of existing private information as time goes by.

Conclusions

Due to the development of big data technology, satellite imagery data are increasingly appreciated by sophisticated investors. In comparison with traditional economic data such as industrial production and consumer price indices, satellite data have greater frequency and are available in real-time. This paper provides a fresh example of the application of satellite data in financial investment decision-making.

As changes in the number of containers reflect real economic activities, we draw on container information based on satellite imagery of major global ports using a deep learning method. Our results suggest that the combined container information predicts stock returns in global mainstream markets. The return predictability is both statistically and economically significant. The predictive power of container information became stronger after the COVID-19 pandemic, especially in the U.S. and European markets. We also show that satellite-based container information leads to the traditional shipping indicators of freight rates and container throughput. The container information is also found to be closely linked to global economic activity, serving as a potential explanation for the return predictability.

Container number data is not available in real-time. Container coverage drawn from satellite images provides a good proxy and can reflect economic changes, thus helping investors in the stock market to obtain excess returns. We execute an investment experiment and find that an investor making use of satellite imagery of seaports will on average receive annualized returns of 16%. In practice, many reasons may reduce the investment gains using satellite information. For example, it typically takes several hours or even longer time to transform the original sensor signals into images and process satellite images. We have accounted for such lags in portfolio analysis. Nevertheless, during the period of data processing, container information may flow into the stock market via other channels such as online text, although the availability of these alternative data remains another problem. We believe that with the development of big data technology, satellite images provide an important information source to help improve investment decisions.