Introduction

For many decades, gravity models have been successfully applied in many different contexts for analysing socio-economic flows of varying types. The well-known examples include migration1,2,3, consumer spatial behaviour4, inter-city telephone communication flows5, hospital-patient flow systems6, and international trade7,8,9,10,11,12.

All these models predict or describe certain behaviours that mimic gravitational interaction, as described in Isaac Newton’s law of gravity. They assume that a flow between two places is directly proportional to their importance (expressed in, e.g., population size, gross domestic product (GDP), or some attractiveness index) and is inversely proportional to the physical distance between them. Thus, the simplest form of the gravity equation, written, for example, for the bilateral trade volume, is given by

$${f}_{ij}=G\frac{{x}_{i}{x}_{j}}{{r}_{ij}^{\alpha }}$$
(1)

where f ij is the trade volume between country i and country j; x i x j is the product of their GDPs; r ij is the geographic distance between them; and G is a constant. Gravity models (GM) work particularly well in systems where all the places are directly connected (i.e., where the underlying structure is a complete graph). International trade network is a typical example of such a system. The value f ij of products or services exported from country i to country j does not affect (at least not directly) the other flows in the network.

Unlike in the above example, most transport networks involve a series of intermediate stops, which are, themselves, generators of originating and terminating traffic (see e.g. Chapter 7 in ref. 13). In such networks, especially for large distances, no direct connection may be present from location i to location j. In these cases, the potential flow, f ij (g), which might be described by Eq. (1), is realized by the increase in subsequent flows \({f}_{i{b}_{1}},{f}_{{b}_{1}{b}_{2}},\,\mathrm{...,}\,{f}_{{b}_{n-1}{b}_{n}},{f}_{{b}_{n}j}\). Obviously, this scenario will lead to an observed flow, that differs from the expected one:

$${f}_{ij}\ne {f}_{ij}^{(g)}\mathrm{.}$$
(2)

It means that, in the case of airline networks, the standard gravity model cannot be directly used to estimate weights of the existing connection flights.

Contrary to appearances, the divergence of the gravity model with actual data may prove useful for obtaining deeper insight into the details of the traffic patterns in transportation networks. In this paper, we demonstrate how one can exploit these discrepancies to discover statistical paths i − b 1 − … − b n  − j underlying the observed flows, f ij , in the network.

Usually, traffic data are collected in two ways. First, the data are obtained by counting objects (e.g., people, vehicles, or information packets) that pass any available link in the network. Such a counting provides information about local traffic intensity, but it says nothing about the places or the objects that started the travel or where they plan to finish. Second, the data are obtained by gathering information about the origin and destination of each object (e.g., from survey data or from travel tickets) without knowledge about the detailed path each object follows.

For this study, we had at our disposal the data of the first type relating to international flights. We have checked that regardless of the choice of x i (GDP, population, size etc.) in the standard gravity model, the flows f ij are not correctly described by Eq. (1). Careful data analysis shows that the observed inconsistency is due to transfer flights, which allow passengers to travel from (or to) less developed regions even though the network is rare. The so-called’transfer passengers’ contribute to reducing flight costs and enhance the frequency of flights, which is profitable especially for large airports. They also have a positive impact on the development of small airports. Thus, the understanding of how people choose between different intermediate airports has great practical potential. In this paper, we make a small contribution toward this goal.

We propose a simple model of connecting flights, which is confirmed by real data. The main assumption of the model is that the potential flows between two countries, \({f}_{ij}^{(g)}\), which includes all the passengers who start the journey in country i and end it in country j, regardless of the transfer flights, is given by the gravity law, Eq. (1), with x i x j referring to the product of GDPs (the case of population size is discussed in Supplementary Material). Although the mentioned assumption cannot be directly verified, it is well supported by the common observation that the gravity relationship arises from almost any microscopic economic model that includes costs that increase with distance7. The last condition is certainly true in most types of transportation networks.

The final subject of this paper is the discussion of the distance coefficient α in Eq. (1). Its behaviour over time is strictly related to the globalization process, which can be conceptualized as a continuous reduction of the effective distance in the world. Unexpectedly, most studies about gravity models in econometrics clearly show that, since the distance coefficient increases in time, the role of the distance grows simultaneously14,15,16,17. This counter-intuitive result is currently known as the missing globalization puzzle. Here, by recovering the gravity relationship in the flight network, we are able to analyse the time dependence of the distance coefficient in a typical transportation network.

The outline of the paper is as follows. First, we provide a version of the gravity model adapted to the flight network. Then, we introducthe model of connecting flights. Finally, we present the obtained results and discuss the behaviour of the distance coefficient. The data used in this study are described in the Methods section.

Results

Simple gravity model

Before we can verify if the gravity model can reproduce the weights of flight connections, we need to determine the value of the constant G in Eq. (1). To do this, one has to keep in mind that, in Eq. (1), in addition to G, there is another free parameter, namely the distance coefficient α. This coefficient is usually found from the slope of the linear relation (see, e.g., Fig. 1 in ref. 17)

$$\mathrm{ln}\,\frac{{f}_{ij}}{{x}_{i}{x}_{j}}=\,\mathrm{ln}\,G-\alpha \,\mathrm{ln}\,{r}_{ij}\mathrm{.}$$
(3)

We will discuss the distance coefficient in the next subsection. At the moment, let us assume, that it its value is known.

In the systems, such as the international trade network, where the flow between i and j only depends on the importance of trading countries, the constant G can be simply obtained from Eq. (1),

$${f}_{ij}{r}_{ij}^{\alpha }=G{x}_{i}{x}_{j},$$
(4)

after summing over all pairs of countries, i.e.

$$\sum _{i,j}{f}_{ij}{r}_{ij}^{\alpha }=G\sum _{i,j}{x}_{i}{x}_{j}=G{X}^{2},$$
(5)

where X is the total world GDP, and the left side of Eq. (5) is related to a distance-averaged value of a typical trade channel. This shows that for a fixed value of α, the parameter G can be calculated directly from real data. Unfortunately, this is not the case of the airline network.

In the air-transport network, besides the main contribution to the flow f ij coming from the’direct passengers’ traveling from i to j, the value f ij also contains those travellers, for which the flight ij is only an intermediate link in a longer chain of flights. In other words, the total number of occupied seats, i.e., the sum of all the elements f ij of the matrix F(t),

$$T=\sum _{i,j}{f}_{ij},$$
(6)

is larger than the total number of traveling people. In particular, people traveling from i to j with one change occur in this sum twice. Correspondingly, those who travel with two changes (i.e., with three connecting flights) are taken three times. Therefore, the global traffic T can be estimated as follows:

$$T\simeq \sum _{l\mathrm{=1}}^{\infty }\,\sum _{(i,j):{d}_{ij}=l}\quad l\cdot {f}_{ij}^{(g)},$$
(7)

where the summation runs over all pairs of countries (i,j), such that the shortest path between them, in terms of the number of links, is d ij , and the expected flow \({f}_{ij}^{(g)}\) is given by the gravity equation (1),

$${f}_{ij}^{(g)}=G\frac{{x}_{i}{x}_{j}}{{r}_{ij}^{\alpha }},$$
(8)

with x i x j standing for the product of GDPs of the connected countries. This means that the constant G can be estimated from the following relation

$$G=T{(\sum _{l\mathrm{=1}}^{\infty }\sum _{(i,j):{d}_{ij}=l}l\cdot \frac{{x}_{i}{x}_{j}}{{r}_{ij}^{\alpha }})}^{-1}\mathrm{.}$$
(9)

Having the constant G estimated, one can plot the observed flows, f ij , versus these expected, \({f}_{ij}^{(q)}\). In Fig. 1, we present the data for two different years, 1996 and 2004, and for three different values of the distance parameter, α = 0, 1, and 3. The straight line demonstrating the expected flows \({f}_{ij}^{(g)}\), resulting from Eq. (8), is also drawn for better comparison. Let us note that the noise, which is inherent to the raw data, makes it difficult to clearly estimate the plotted relation (see Fig. 1b). To overcome this problem, in all the figures, we present logarithmically binned data only.

Figure 1
figure 1

The observed weights of connections in the airline network, f ij , vs. their expected values, \({f}_{ij}^{(g)}\). Plots in the same row correspond to the same year: 1996 (top row) and 2004 (bottom row). Values of the distance coefficient α are indicated in the plots. All data are logarithmically binned (black squares). In panel (b), we have also shown raw data for comparison (grey squares).

It is obvious that the direct applicability of the gravity model to the flight network is at least questionable. The best fit is obtained for α ≈ 1 (panels b) and e) in Fig. 1), which coincides with the results obtained by other studies of the distance coefficient in econometric data17. However, even if one agrees with such a choice of the distance coefficient, the fit is correct only for the right part of each plot. Over a span of at least three decades, the expected, \({f}_{ij}^{(g)}\), and the observed flows, f ij , differ even by several decades. It seems that there are important factors at play other than economic ones that increase the passenger flow between some countries. In the next section, we will show that the connecting flights from country i to j, which do not depend of the economic conditions, x i x j , of these two countries, can radically change the total flow f ij , and we explain the discrepancies between the gravity model and real data presented above.

Model of connecting flights

We claim that the passenger flow, f ij , from country i to country j, that is observed in the data, is composed of two components:

  • \({f}_{ij}^{(g)}\) - the number of passengers traveling directly from the origin of a trip in the country i to the final destination in country j, which, we assume, is given by Eq. (8),

  • and the number of passengers, \({f}_{ij}^{(transit)}\), who use the connection i → j as a part of their longer journey.

For simplicity, we assume that these longer journeys consist of two direct flights only, i.e., we neglect travels with two or more intermediate stops. This assumption seems to be quite strong. For example, for 2004, we have flight data for 151 countries and 22650 possible connections between them. Only 2308 (10%) of them are direct. There are also 12749 (56%) shortest paths with length equal to 2. It means that we take into consideration only 66% of all possible connections between the countries. However, it is reasonable to expect that the number of passengers traveling with two or more stops is much lesser than the lacking 34% of the global traffic. One of the possible reasons for this is that too many transfers complicate the chance for a convenient schedule, which costs valuable time. Then, it is usually better to choose other kind of transportation to reach a destination. We will come back to this issue later when we discuss the obtained results.

The number of passengers \({f}_{ij}^{(transit)}\) can be estimated as follows:

$${f}_{ij}^{(transit)}=\sum _{k:i\to \,k}{f}_{ik}^{(g)}\cdot p(i\to j\to k)\,+\sum _{l:l\to \,j}{f}_{lj}^{(g)}\cdot p(l\to i\to j),$$
(10)

where the first (second) summation is over such nodes k (respectively l), that there is no direct connection from i to k (from l to j). The term p(ijk) describes the probability that one takes a direct flight from i to j during indirect travel from i to k. Contributions of both summations to the total transit passenger flow \({f}_{ij}^{(transit)}\) are graphically depicted in Fig. 2.

Figure 2
figure 2

Graphical presentation of the summations in Eq. (10).

The choice of a particular connecting flight from i through j to k (which is expressed by the probability p(ijk)) should depend, in the first approximation, on the distance r ij between i and j, and the distance r jk between j and k. Thus, we omit the other factors such as the convenient flight schedules and type or level of airline service or airport quality, that could influence actual passenger behaviour18. Therefore,

$$p(i\to j\to k)=C\cdot f({r}_{ij},{r}_{jk}),$$
(11)

where C is a normalization constant, which is given by

$$\sum _{j}p(i\to j\to k)=\mathrm{1,}$$
(12)

and the function f(r ij , r jk ) should reflect the tendency of the passengers to choose the shortest, and therefore, the cheapest or the fastest connections. Among many possible choices, we have chosen the following form for this function

$$f({r}_{ik},\,{r}_{jk})=\frac{1}{{r}_{ij}{r}_{jk}},$$
(13)

although the other possible forms, e.g.

$$f({r}_{ik},{r}_{jk})=\frac{1}{{r}_{ij}}+\frac{1}{{r}_{jk}},$$
(14)

lead to similar quantitative results (see Supplementary Material for details).

Now, having the model defined, one can estimate the total passenger flow between any two countries as follows:

$${f}_{ij}^{(mcf)}={f}_{ij}^{(g)}+{f}_{ij}^{(transit)},$$
(15)

whose components are correspondingly given by Eqs (8) and (10)–(13).

Discussion

In Fig. 3, we compare results obtained from our model of connected flights with real data for two different years, 1996 and 2004. We also plot there the straight lines corresponding to the classical GM, Eq. (8), to demonstrate a significant improvement in performance of the expanded model over GM alone. The largest discrepancies visible in the left part of the plots occur for the long-distance countries with low GDPs, i.e., for large (small) values of the denominator (nominator) in the horizontal axis in Fig. 3. We have checked that these countries are usually island-based (African, Caribbean and Pacific states), and therefore, the travel between them requires multiple transfers - the feature that is not included in our one-stop model. Moreover, a lack of transport alternatives in these countries makes air travel channels more preferred than in the typical continental states. Although it is possible to extend the model to include two-stop connections, we think it is not worth the price, i.e., the significantly increased complexity of the model, especially since its present form correctly predicts more than 98% of the total passenger flow in the world.

Figure 3
figure 3

Performance of the model of connected flights (black squares) against real data (open circles) for two years: 1996 and 2004. Straight lines correspond to the standard gravity model.

The numerical results for \({f}_{ij}^{mcf}\) shown in Fig. 3 have been obtained for the particular values of the distance coefficient α (the reason why we have chosen α = 1.5 and α = 1.6 for years 1996 and 2004, respectively, will become clear shortly). One has to keep in mind that the other values of this quantity can lead to the different results and to better or worse agreement between the model and real data. We can use this observation to select the most probable value of α and to analyse the behaviour of the distance coefficient in time. As mentioned in the introduction, this behaviour can be strictly related with the progress of the globalization process in the context of transportation networks. Thus, analysing changes of the distance coefficient would provide another indicator of the rate of the global integration.

For every year in the analysed period 1990–2011, we have created the histograms of empirical and modelled flows, P(f ij ) and P(\({f}_{ij}^{mcf}\))(α), respectively, in m = 15 logarithmically spaced bins. The examples of such normalized histograms for year 1996 are presented in Fig. 4b. As one can see, the histograms P(\({f}_{ij}^{mcf}\))(α) created for different values of the α parameter differ in agreement with the histogram of empirical flows (marked by the shaded grey area). To measure this agreement, Δ(α), we use a simple RMS formula

$${\rm{\Delta }}(\alpha )=\frac{\sqrt{\sum _{h\mathrm{=1}}^{m}{({P}_{h}({f}_{ij})-{P}_{h}({f}_{ij}^{(mcf)})(\alpha ))}^{2}}}{N}\mathrm{.}$$
(16)

In Fig. 4a, we show how this quality measure, Δ(α), depends on the parameter α in the year 1996. The clearly visible minimum at α = 1.5 indicates the correct value of the distance coefficient in this year.

Figure 4
figure 4

(a) Example of the agreement measure Δ(α) calculated for different values of parameter α in the year 1996. The arrows show the values for which three histograms P(\({f}_{ij}^{mcf}\))(α) are shown in panel (b). Grey shaded area represents the histogram P(f ij ) characterizing real data.

Figure 5 demonstrates the behaviour of the distance coefficient for the years 1990–2011 retrieved by this method. The general conclusion that follows from the figure is that the distance effect in air transportation network is constant over time and the globalization process, which is reflected in the distance coefficient has been stabilized in the XXI century. This conclusion confirms the other results (presented by the grey circles in Fig. 5) obtained in ref. 17, where the authors estimated the distance coefficient for the international trade network.

Figure 5
figure 5

The year-by-year values of the distance coefficient α for the air transportation network resulting from the minimalization of the measure Δ(α) (black squares) and for the world trade network taken from a previous paper17 (grey circles).

Now, let us shortly analyse the major fluctuations around this constant distance coefficient. In Fig. 5, we have marked three historical events that could influence the behaviour of the distance coefficient in the same way as they had an impact on the whole aviation industry. Attacks in New York and Washington D.C. in September of 2001 started a chain of events such as SARS epidemic, additional terrorist attempts, wars, and rising oil prices, that cost the airline industry three years of growth. Airline revenues and traffic surpassed 2000 levels only in 200419. The 2008 global financial crisis costed another several years of growth. The effect was further enhanced by the eruption of the Eyjafjallajökull volcano in Iceland in 2010 that caused the closure of airspace over many countries. The correlation between the distance coefficient and all these events, shown in Fig. 5, confirms that they have a negative impact not only on airline revenues or air traffic but on the whole globalization process.

It should be noted that the globalization process is sometimes conceptualized as a continuous reduction in the effective distance in the world20, which means that the distance coefficient should vanish in time. However, the observed temporary decrease in the distance coefficient is evidently negatively correlated with the progress of globalization. It confirms the recent observations that the distance coefficient is rather associated with the fractal dimension of the considered system and a decrease in this coefficient is the effect of decreasing number and weight of air transport connections, which reduce dimensionality of the system17.

Concluding remarks

The presented model of connecting flights allowed us to retrieve, from the observed flow between any two countries, the terms corresponding to direct and transfer passengers utilizing this connection. Although we neglected many aspects that influence the choice of intermediate airports by travellers, the model allows to correctly predict more than 98% of the total passenger flow in the world. The only assumption we had to take into account was that the gravity model is applicable to the case of air transport network. The correctness of the above assumption was confirmed by the time behaviour of the retrieved distance coefficient that reflects several historical events with known strong economic impact.

There are still many possible research directions that may be worth exploring in this area. First, the most promising of these seems to be derivation of the so-called fluctuation-response relations21 that would allow predictions of the changes in the flows f ij on the basis of changes in GDPs of the connected countries. Now, when we can determine direct and indirect contributions to the particular flow, this should be possible by the analogy to the similar approach done for international trade network12. Next, it would be challenging but also rewarding to extend the model taking into account, e.g., time schedules that strongly determine the passenger preference to select a particular intermediate airport. This would generally allow modelling of the microscopic time-dependent flows in the network. Analysing more detailed level of the air transportation network, in which the nodes represent rather single cities or even airports22 than the whole countries, can also be interesting for strategic planning in the airport industry.

Methods

Results reported in this paper are based on data provided by International Civil Aviation Organization (ICAO). They contain “annual traffic on-board aircraft on individual flight stages of international scheduled services”23. As a flight stage or a direct flight, we understand “the operation of an aircraft from take-off to landing”24. It means that if a particular flight consists of two (or more) flight stages, we consider it as two (or more) separated direct flights.

Among the many attributes the data contain, such as aircraft type used, the number of flights operated, the aircraft capacity offered, and the traffic (passengers, freight and mail) carried, in our analyses, we use only the number of passengers traveling between countries. The data are employed to build a sequence of weighted directed networks, F(t), in the consecutive years t = 1990, …, 2011. In each network, each country is represented by a node and the weight of a link f ij (t) refers to the number of passengers traveling from i to j in year t. The flows f ij (t) may vary from a few persons (e.g., 6 people travelled for Togo to Uruguay in 2004) to several millions of passengers (e.g., 9532303 people travelled from Great Britain to USA in 2000).

Apart from traffic data, we also use econometric data from Penn World Table 8.125. To characterize the economic performance of a country we use real GDP at constant 2005 national prices value x i (t) (in mil. 2005US$). The distance between countries is based on CEPII data26. Geodesic distances therein are calculated following the great circle formula, which uses latitudes and longitudes of the most important cities/agglomerations (in terms of population).