Introduction

Measuring monocentricity

People in cities interact with their environment by developing urban land for different socioeconomic activities. The way in which land use is located and arranged within a city is either a result of self-organising mechanisms over the course of time or as a result of specific interventions through different varieties of urban planning, at different spatial scales. In this context, the configuration is usually referred to as urban structure and through the study of such structures, we can learn more about the spatial behaviours of the societies that, over time, have built them. Moreover, urban structure also plays an important role in shaping the present and future, given its impact on different socioeconomic features such as mobility, access to jobs, social mixing, heterogeneity, segregation, deprivation, urban efficiency, and sustainability.

The simplest form of urban structure corresponds to the monocentric city, where socioeconomic activity is localised in a unique central region. In practice, monocentric cities facilitate the accumulation of social interactions and innovation, and consequently give rise to economies of agglomeration characterised by increasing returns to scale1,2. However, monocentric cities are also subject to heavy tidal flows on the transport facilities during peak hours, severe congestion and disproportionally high rents close to the centre3,4,5. The monocentric structure of cities prevailed until the industrial revolutions led to new forms of transport that broke the bounds of compact cities. Consequently, monocentric cities have gradually decentralized, transforming into complex hierarchies of different kinds of centres, neighbourhoods and sprawling structures that are tied together by a multiplicity of transport and information systems6. Yet, explanatory models of urban structure based on a monocentric approach are still used due to their simplicity and formal analytical elegance. Their validity, however, should be questioned both in terms of the theoretical assumptions needed for the formulation of the models7,8,9 and from the point of view of public policy10, since most plans for future cities have long abandoned the idea of the monocentre.

Polycentricity has therefore become the focus of much spatial policy11, since it is believed that urban dwellers in polycentric cities might benefit from congestion relief in comparison with their monocentric counterparts12 and from increased accessibility to jobs and services, which may translate in higher rent and housing prices all across the city, but also in more time-efficient and cost-efficient travel. Despite the raise in popularity of the idea of polycentric development, it remains a rather fuzzy concept as it seems to mean different things to different actors and on different scales11,13,14,15,16. The lack of a concise and coherent definition raises an issue: how to measure polycentricity? If we do not know what to measure, we simply cannot measure it11,15,16. In this work, instead of attempting to answer the ill-defined question ‘to what extent is a city polycentric?’, we provide an approach to analysing departures from a well-defined concept of monocentricity.

Despite the fuzziness in the definition of ‘polycentric city’, there is a long tradition of theoretical research and empirical evidence surrounding the debate on monocentricity versus polycentricity. We will simply indicate recent work here such as that based on an analysis of data from US metropolitan areas by Arribas-Bel and Sanz-Garcia10, which shows that monocentricity still retains a substantial influence on the intraurban structure of many metropolitan areas. This is despite the general consensus in the literature that modern cities above a certain size threshold become polycentric and that monocentricity is an older concept more appropriate to the city in history prior to the industrial revolution. In this sense, the concept might perhaps be somewhat obsolete when dealing with the real world. Additionally, the authors cited in10 find that there is no clear evolutionary trend in US cities towards polycentricity between 1990 and 2010.

By contrast, Alidadi and Dadashpoor17 analyse data from Iran to find that a monocentric model is not able to explain the spatial distribution of employment in Tehran while the main core has been losing its importance with the passage of time. Li18 draws upon fine-grained LandScan population data corresponding to 286 Chinese cities to find that in general, urban spatial structure has become more polycentric as well as more concentrated (i.e. with a higher share of their population living in the centres) while these changes have usually resulted in population and economic productivity growth.

There are other studies that find evidence for mixed types of urban structure. Hajrasouliha and Hamidi19 base their study on three typologies of urban structure: monocentricity, polycentricity, and generalised dispersion. When analysing the spatial structure of employment data from 356 US metropolitan regions, they find that mixed typologies of urban structure outnumber the three “pure” ones by almost four to one. They also find that polycentricity is somewhat more common than monocentricity. Similarly, in Ref.20, Sweet et al use cross-sectional data to estimate the relative strengths of monocentricity, polycentricity, and dispersion for characterising Canadian cities. Their results indicate that elements of each model are evident, but each tends to dominate in different contexts. When focusing on Montreal, Toronto, and Vancouver, their results imply that accessibility, municipal competition, and globalisation play a role in shaping urban structure.

Human mobility and urban structure

In the past, most empirical studies of urban structure based their conclusions on data associated with the spatial distribution of employment or population, obtained largely from traditional sources of direct observation by questionnaires such as surveys, censuses and administrative records. The rationale behind this choice of datasets is that they are comprehensive and representative and have the potential to uncover where city dwellers conduct most of their socioeconomic activity. It has only been in recent years that the focus has been turned to alternative data sources, which can offer real-time and easy-access records at very small spatial scales. In particular, the locations that people choose to visit at different times of the day or week are very much conditioned by the spatial structure of the city and at the same time, the complexity of human movements shapes the usage of urban space and the arrangement of resources6. Therefore, the study of patterns in human mobility through alternative data sources can help us understand the travel behaviour of city dwellers and it can also help us uncover the socio-economic features of urban structure.

For example, recent studies have used data derived from social media platforms as well as location tracker devices in mobile phones in order to understand how populations distribute across the urban landscape based on the places visited by the users, both at the intra-city and inter-city scales21,22,23,24,25,26,27,28,29,30,31,32,33. Taxi trajectory data is another alternative source of data that has gained in popularity in recent years as a means to uncover information about urban structure21,34,35,36,37,38. Taxi trajectory data not only has the potential to reveal the characteristics of human movement within the city but also real-time traffic status as well as potential social inequalities. A third alternative source of data is that derived from smart travel cards or simply, smart cards. Like the other sources of alternative data already mentioned, smart card data offers information regarding daily human activities at high resolution, both in the spatial and temporal domains and consequently, it has been used to explore urban structure39,40,41,42,43,44. Here, we will focus on the latter type of new data sources that record such movements. In our case, this is smart card data from the automatic fare collection system in London’s and Seoul’s public transport, which contains information about the origin, destination and time at which each individual journey occurs.

Aim and contribution

Our aim is to provide a novel approach to model the extent to which a city departs from the monocentric structure by considering the variability inherent in human mobility patterns, and by avoiding the fuzziness in the concept of polycentricity. To investigate the applications of this approach, we consider two case studies using high spatiotemporal resolution data derived from smart travel cards corresponding to London, United Kingdom, and Seoul, South Korea.

Our methodology first considers the frequency distribution of the length of journeys terminating at each station in the public transport system of a given city on a typical weekday. We define the “nucleus” of each city as the station representing a hypothetical centre. We also consider the network structure of the public transport system in order to measure the length of the journeys by the network distance between stations. We then introduce Poisson mixture models as a statistical approach to describe the frequency distribution of the length of the journeys terminating at each station in the transport system. The Poisson mixture models enable us to capture the variability in the human mobility patterns reflected in urban structure, which in real cities includes a blend of features from both monocentric and polycentric cities.

Next, we state what we call the monocentric hypothesis: “If a city was perfectly monocentric, the expected length of the journeys taken to a given station, except for the nucleus itself, would be equal to the length of the shortest path between the nucleus and the destination station”. In this hypothetical scenario, the nucleus would be the only centre for socioeconomic activity in the city, and consequently, a typical journey terminating at a given station other than the nucleus would have its origin at the nucleus. Journeys whose destination is the nucleus would have their origin at various locations across the public transport system.

In reality, cities and urban mobility patterns are more complex than stated by the monocentric hypothesis, so quantifying deviations from this idealistic behaviour enables us to understand the extent to which a city departs from monocentricity, or in other words, it enables us to indirectly infer its degree of polycentricity.

Therefore, the main contribution of this analysis is a solution to the problem of quantitatively describing the degree of monocentricity of a city. Our data-driven approach based on mixture models considers the complexity of urban space since these models are able to capture the variability in human movements that arises as a result of sophisticated forms of urban structure. Instead of considering discrete typologies of urban structure (e.g. monocentric and polycentric), the method proposed here conceptualises urban structure typologies as a spectrum, where monocentricity is an idealistic extreme. While the contribution of this paper is mostly methodological, we use London and Seoul as case studies to illustrate how our method can be applied. According to the observed patterns of human mobility, we specifically find evidence that London displays a higher degree of monocentricity than Seoul, suggesting that Seoul is likely to be more polycentric than London.

The rest of the paper is organized as follows. In “Methodology”, we describe in detail the data sets used for the analysis and how the data has been processed. Section “Experiments and results” is dedicated to the methodology which we followed for the analysis. We explain how we conceive of the public transport system as a complex network. We also introduce the probabilistic modelling framework and mixture models. In “Discussion and conclusions”, we present the results of the analysis corresponding to the two case studies of London and Seoul. We provide some concluding remarks and points of discussion in Section 5. We also included “Supplementary Information” with some additional results to support our findings and conclusions.

Methodology

Data and notation

The Oyster card in London and the T-money card in Seoul are automatic fare collection systems that record the place and time when a traveller enters and exists the public transport system by tapping in and out with their card. In 2012, more than 80% of all journeys on public transport in London were made using Oyster card whereas in Seoul 98.9% of all journeys on public transport were made using T-money in 2013. For London, we use Oyster-card data recorded during 5 full weekdays (24 h) between January 20 and 24 in the year 2014, and for Seoul, we use T-money card data recorded during 4 full weekdays between December 17 and 21 in the year 2012. We exclude the data corresponding to Wednesday, December 19, because it was a presidential election day in Korea and regular travel patterns were disrupted.

While we hold data sets containing tap-in and tap-out records of Oyster and T-money cards, considering one or the other type of record will not affect the results of the analysis. This claim is based on the assumption that, on a daily basis, a passenger who takes a journey from one station to another, with possible stops on the way, will typically “undo” the journey by going back to the original station where they departed from at some point during the day. Even though our assumption might not always be true, this behaviour is frequently displayed by the passengers who commute daily to work, school or other regular activities, and who represent the majority of users of the public transport system on weekdays. Hence, in this analysis, we use tap-out records but we claim that analogous results would be obtained using tap-in records instead.

Based on the tap-out records, we obtain the daily count of journeys of a given length which terminate at each station on a typical weekday. For a given journey length and a given station, the daily count on a typical weekday is computed by averaging the daily count over all the weekdays included in the raw data set and rounding the average value to the closest integer. The sum of the average daily count of journeys for all the stations was 3.22 million encompassing 382 stations in London and 5.96 million for 512 stations in Seoul.

For the subsequent parts of the analysis, we introduce the following notation. Each of the N stations in each city is symbolised by \(S_i\), with \(i=1,...,N\). Station \(S_i\) is the destination of \(M_i\) journeys so it has \(M_i\) tap-outs. The length of the lth journey terminating at station \(S_i\) is symbolised by \(L_i^l\), where l is an index over the \(M_i\) journeys to \(S_i\) and therefore, \(l=1,...,M_i\).

The transport system as a complex network

We define an undirected simple network \(G = (V,E)\) as an abstract conceptualisation of the public transport system. The network G is formed by the set of N vertices or nodes V and the set of edges E. The ith node of the network G corresponds to station \(S_i\) in the public transport system. An edge is present between two nodes i and j if there is at least a line of transport that provides a direct connection between the stations \(S_i\) and \(S_j\). The distance between stations \(S_i\) and \(S_j\) is symbolised by \(d_{ij}\) and is defined as the minimum number of edges that need to be traversed in order to travel from \(S_i\) to \(S_j\). The length of a journey between two stations is defined as the number of edges that are traversed from the origin to the destination nodes, but here we assume that the length of a journey between stations \(S_i\) and \(S_j\) is equal to the distance \(d_{ij}\), i.e. we assume that, from all possible trajectories from \(S_i\) to \(S_j\), passengers always choose the one involving the fewest stops. Here, we use London and Seoul as illustrative case studies, but we should point out before proceeding to the next steps that the methodology that we propose here is generalisable to other cities.

For the next steps of the analysis, we establish a hypothetical centre in the network, which we call the “nucleus”. Different notions of centrality can be considered, although not all of them are suitable for our analysis. For example, if we considered that the nucleus is the closest station to the geographical centroid of the city, then the definition of centre would depend on the physical boundaries of the city region, which in turn, can be established according to a variety of different criteria. If instead, we considered a measure of centrality based on the topology of the network, such as the betweenness centrality of each node, then the nucleus in Seoul would be Wangsimni station, which does not necessarily represent what many Seoulites would consider to be a central region of Seoul. Similarly, a measure of centrality based on traffic flows may also not coincide with what most people consider to be the centre. For these reasons, we opt for a somewhat arbitrary choice of nucleus based on what is popularly considered to be a central area: Piccadilly Circus station in London and City Hall station in Seoul. We assign the index \(i=1\) to the station corresponding to the nucleus in each city. To counter the arbitrariness of these choices of nucleus, we provide a sensitivity test of the results of our analysis. This can be found in the “Supplementary Information”, in the section titled “Sensitivity analysis for different choices of nucleus”.

Figure 1 shows the physical layout of London and Seoul’s public transport system, with the lines and stations that are included in the data sets. The transport lines are traced simply as straight lines to show the topology of the network. The colour of the nodes corresponds to the average length of the journeys terminating at \(S_i\), symbolised by \(\bar{L}_i\) and the size represents the number of journeys \(M_i\) reaching each station \(S_i\).

Figure 1
figure 1

Map of London’s and Seoul’s public transport systems. The figure has been produced using Python and Inkscape, based on the data sets described in “Data and notation” and geographic data from OpenStreetMap for the basemap layers.

From Fig. 1, it is evident that the cities considered here have different spatial extents and this could be affecting the observed degree of polycentricity. One way to make the analysis more comparable between cities could be to normalise the journey length measure between stations so that both cities are on the same spatial scale. In this analysis, network distance is used to measure journey length instead of physical distance, but network distance does not always reflect the spatial extent of the cities. For instance, even if the spatial distance between an origin and a destination station is large, the network distance would remain relatively small if there are not many stations in between. Therefore, normalising the measure of network distance would not necessarily improve the comparability of the analysis for different cities. Furthermore, while the maximum spatial or network distance may be an indicator of a city’s polycentricity, it is not the only one. For example, other factors such as public transportation accessibility, land use patterns, economic development and even culture may also affect the degree of polycentricity in a city. So even if both cities were compared on the same distance scale, there would always be other factors that remain unnormalised. For these reasons, we argue that it is appropriate to keep the measure of journey length unnormalised.

Modelling the distribution of journey lengths

In this section, we introduce a probabilistic approach to model the frequency distribution of journey lengths on a typical weekday. We regard \(L_i\) as a discrete random variable denoting the length of journeys whose destination is station \(S_i\). For each station, our data set gives \(M_i\) realisations of \(L_i\), so the observed length of the lth journey, symbolised by \(L_i^l\), would correspond to the lth realisation of \(L_i\). The true probability distribution of random variable \(L_i\) is unknown, however, its empirical probability density function, denoted by \(\hat{f}_i\), can be obtained from the observed data as

$$\begin{aligned} \hat{f}_i(L_i = h) = \frac{1}{M_i}\sum _{l=1}^{M_i}\mathbbm {1}_{L_i^l=h}, \end{aligned}$$
(1)

with \(h\in \mathbb {N}\). In Eq. (1), \(\mathbbm {1}_{L_i^l=h}\) is an indicator function that takes the value 1 when \(L_i^l = h\) and zero otherwise. Hence, the probability that the length of a journey with destination at station \(S_i\) is equal to h is approximated by \(\hat{f}_i(L_i = h)\), computed as the number of observed counts of journeys of length h terminating at \(S_i\), divided by the total number of journeys terminating at \(S_i\), i.e. \(M_i\).

Under the monocentric hypothesis stated in “Aim and contribution”, if a city was perfectly monocentric, the expected length of the journeys taken to a given station, except for the nucleus itself, would be equal to the length of the shortest path between the nucleus and the destination station. Hence, this null hypothesis can be expressed mathematically as Eq. (2)

$$\begin{aligned} E[L_i] = d_{1i} \end{aligned}$$
(2)

for \(i=2,...,N\), where \(E[L_i]\) is the expected value of random variable \(L_i\), which can be approximated by the sample mean \(\hat{\mu }_i = \frac{1}{M_i}\sum _{l=1}^{M_i}\mathbbm {1}_{L_i^l}\). Taking this into account, the monocentric hypothesis can be expressed as \(\hat{\mu }_i = d{1i}\).

In reality, the data does not lie on the line given by \(\hat{\mu }_i = d_{1i}\), as shown in Fig. 2. In the Figure, the network distance from the nucleus to the destination station is represented on the x-axis and the average length of journeys arriving at a station is represented on the y-axis. Each bubble in the plot represents one station. The solid line is the regression line obtained via ordinary least squares. The results of the linear regression are shown in Table 1. The red dotted line represents Eq. (2), i.e. the line where points would lie if the monocentric hypothesis was satisfied. In fact, there is a tendency for the average length of the journeys terminating at station \(S_i\) to be less than \(d_{1i}\) as \(d_{1i}\) gets larger. This suggests that journeys which terminate at stations that are far from the nucleus, tend to take place more locally. The effect is particularly obvious in Seoul, showing that the observed patterns of mobility depart from the monocentric hypothesis to a greater extent.

Figure 2
figure 2

Relationship between the mean of the distribution of journeys terminating at each station, \(\hat{\mu }^p_i\) and \(\hat{\mu }^d_i\), and the distance \(d_{1i}\) between the nucleus \(S_1\) and the destination station \(S_i\).

Table 1 Results of linear regressions considering \(d_{1i}\) as the explanatory variable and \(\hat{\mu }_i\) as the response variables. Piccadilly Circus is considered to be the nucleus in London and City Hall in Seoul.

In addition, we observe not only that \(\hat{\mu }_i = d_{1i}\) is not satisfied, but also that \(L_i\) displays a high degree of variability for \(i=1,...,N\). This effect is captured in Fig. 3, where each data point corresponds to an individual journey, the x-coordinate represents the distance \(d_{1i}\) between the nucleus and the destination station, and the y-coordinate represents the length of each individual journey \(L_i^l\), with \(l=1,...,M_i\) and \(i=1,...,N\). Once again, the red dotted line represents Eq. (2).

Figure 3
figure 3

Distribution of the length of journeys terminating at any destination station which is at a given distance from the nucleus. The solid line indicates the median of the distribution for each value of \(d_{1i}\). The dashed line represents the line \(L_i^l = d_{1i}\).

Figures 2 and 3 are thus a manifestation that real cities do not conform to the hypothesised monocentric scenario. Next, we explore the deviations from the monocentric hypothesis by leveraging the observed variability in our data.

An approach based on mixture models

In order to describe the deviations from the hypothesised monocentric behaviour, we introduce mixture models, which are probabilistic models with the ability to represent the possible presence of different statistical sub-populations within the overall population. In the context of this paper, mixture models can be used to infer the possible presence of centres other than the nucleus based only on the data for the number and length of journeys terminating at each station. The approach that we propose here consists in assuming that the true probability distribution for the number of journeys to station \(S_i\) is given by a mixture distribution of the following form

$$\begin{aligned} f_i(L_i = h|\textbf{w}_i, \varvec{\theta }_i) = \sum _{j=1}^Kw_i^jp_i^j(L_i = h|\varvec{\theta }_i^j). \end{aligned}$$
(3)

In Eq. (3), the probability that a journey terminating at station \(S_i\) has length \(L_i = h\), is now conditional on the parameters of the true distribution, \(\textbf{w}_i\) and \(\varvec{\theta }_i\). The probability density function of the true distribution is given by a weighted sum of K probability density functions corresponding to each of the components of the mixture. The number of components in the mixture K corresponds to the number of centres assumed by the model. If \(K=1\), then the only centre accounted for in the model is the nucleus, but if \(K>1\), the model assumes that there are subcentres other than the nucleus. The weights of these components are given by the column vector \(\textbf{w}_i\), with K entries that satisfy \(\sum _{j=1}^Kw^j_i = 1\). Each component has an associated probability density function given by \(p^j_i(L_i=h|\varvec{\theta }_i^j)\), with parameters \(\varvec{\theta }_i^j\) and so, \(\varvec{\theta }_i\) is a matrix with K rows and as many columns as the number of parameters that characterise the probability density function \(p^j_i\).

For the purposes of our analysis, we set \(p^j_i\) to be a Poisson distribution with parameter \(\mu _i^j\), so that \(p^j_i(L_i=h|\varvec{\theta }_i^j)\) is now \(p^j_i(L_i=h|\mu _i^j) = \frac{1}{h!}(\mu _i^j)^h\exp (-\mu _i^j)\), for \(i=1,...,N\) and \(j=1,...,K\). This choice of distribution is motivated by the fact that the length of journeys terminating at station \(S_i\) represented by random variable \(L_i\) is in the form of count data, but also by the mathematical simplicity of the Poisson distribution, which is characterised by only one parameter.

In order to find the maximum likelihood estimates of \(\textbf{w}_i\) and \(\varvec{\theta }_i\) given the observed data for station \(S_i\), we apply the expectation-maximisation algorithm. The number of components K for the Poisson mixtures is one of the hyperparameters of the algorithm and needs to be determined before the learning process. In “Experiments and results”, we discuss different choices for the number of components K.

Experiments and results

The case where \(K=1\) is equivalent to a simple Poisson distribution. The maximum likelihood estimator for the Poisson parameter \(\mu _i\) corresponding to station \(S_i\) is the average of the \(M_i\) observations for \(L_i\). Therefore, a visualisation of the relation between the estimated Poisson parameter corresponding to a station and the distance between the nucleus and the station is provided in Fig. 2.

Poisson mixture model with two components

To account for the presence of centres other than the nucleus, we introduce Poisson mixture models with \(K=2\). The parameters of a Poisson mixture with two components are the weights \(\textbf{w}_i = (w^1_i, w^2_i)\) and the distribution parameters for each component \(\varvec{\theta }_i = (\mu ^1_i, \mu ^2_i)\), which are also the mean of each component. We can obtain the maximum likelihood estimators for these parameters by applying the expectation-maximisation algorithm to data corresponding to \(L_i\) and we denote them by \(\hat{w}^1_i\), \(\hat{w}^2_i\), \(\hat{\mu }^1_i\) and \(\hat{\mu }^2_i\). We refer to the component with the lowest estimated mean as the proximal component. We denote its associated weight by \(\hat{w}^p_i\) and its mean by \(\hat{\mu }^p_i\), so that \(\hat{\mu }^p_i = min( \hat{\mu }^1_i, \hat{\mu }^2_i)\). Similarly, we call the distal component the component whose estimated mean is higher and denote its associated weight by \(\hat{w}^d_i\) and its mean by \(\hat{\mu }^d_i\), so that \(\hat{\mu }^d_i = max( \hat{\mu }^1_i, \hat{\mu }^2_i)\). The 2-component Poisson mixture model enables us to classify each of the \(M_i\) individual observations of \(L_i\) as belonging to the proximal or distal components with probability given by the respective estimated weights. Therefore, journeys belonging to the proximal component are those that take place at a local scale and the journeys belonging to the distal component take place at a global, city-wide scale.

Figure 4 shows the relationship between the proximal and distal means corresponding to a given station \(S_i\) and the network distance \(d_{1i}\) between \(S_i\) and the nucleus \(S_1\). Both the proximal and distal means, \(\hat{\mu }^p_i\) and \(\hat{\mu }^d_i\), are represented in the y-coordinate. The size of the data points is proportional to the number of journeys terminating at each station. The results of the regression are displayed in Table 2. In the “Supplementary Information”, we discuss the relationship between \(d_{1i}\) and the weights corresponding to station \(S_i\), \(\hat{w}^d_i\), \(\hat{w}^p_i\).

Figure 4
figure 4

Relationship between the proximal and distal mean of the distribution of journeys terminating at each station, \(\hat{\mu }^p_i\) and \(\hat{\mu }^d_i\), and the distance \(d_{1i}\) between the nucleus \(S_1\) and the destination station \(S_i\).

Table 2 Results of linear regressions considering \(d_{1i}\) as the explanatory variable and \(\hat{\mu }_i^p\), \(\hat{\mu }_i^d\) from the 2-component Poisson mixture model as the response variables. Piccadilly Circus is considered to be the nucleus in London and City Hall in Seoul.

As \(d_{1i}\) becomes larger, there is no significant increase in the proximal mean \(\hat{\mu }^p_i\), since it remains around 5 and never above 10. The effect is strikingly obvious in the case of Seoul. These observations are likely to be the consequence of the existence of other socioeconomic centres, closer to the destination station \(S_i\), where passengers prefer to travel to carry out some socioeconomic activities at a more local level. In contrast, the distal component displays a significant linear growth with \(d_{1i}\). The distal component captures long-distance, city-wide journeys from stations that are possibly close to the nucleus, to stations that are in the peripheral regions of the city.

Poisson mixture model with three components

Similarly, with \(K=3\), the parameters to be estimated are \(\textbf{w}_i = (w^1_i, w^2_i, w^3_i)\) and \(\varvec{\theta }_i = (\mu ^1_i, \mu ^2_i, \mu ^3_i)\). The proximal and distal components are defined as for \(K=2\). We call the third component remaining the medial component, and we denote its weight by \(w^m_i\) and its distribution parameter by \(\mu ^m_i\). The results of the linear regression between the mean of each component and \(d_{1i}\) are gathered in Table 3. Figure 5 represents the relationship between the proximal, medial and distal means corresponding to each station \(S_i\), i.e. \(\mu ^p_i\), \(\mu ^m_i\) and \(\mu ^d_i\), and the distance between \(S_i\) and the nucleus. In the “Supplementary Information”, we discuss the relationship between \(d_{1i}\) and the weights corresponding to the three-component Poisson mixture model for station \(S_i\), i.e. \(\hat{w}^d_i\), \(\hat{w}^d_i\), \(\hat{w}^p_i\).

Table 3 Results of linear regressions considering \(d_{1i}\) as the explanatory variable and \(\hat{\mu }_i^p\), \(\hat{\mu }_i^m\), \(\hat{\mu }_i^d\) from the 3-component Poisson mixture model as the response variables. Piccadilly Circus is considered to be the nucleus in London and City Hall in Seoul.
Figure 5
figure 5

Relationship between the proximal, medial and distal mean of the distribution of journeys terminating at each station, \(\hat{\mu }^p_i\), \(\hat{\mu }^m_i\) and \(\hat{\mu }^d_i\), and the distance \(d_{1i}\) between the nucleus \(S_1\) and the destination station \(S_i\)

The behaviour of the proximal and distal components is analogous to the \(K=2\) case. However, adding a third component allows to capture the variability in the data with even more detail. Theoretically, Poisson mixture models have no limitation for the maximum number of components to be added in their formulation, however, increasing K indefinitely is not always sensible since it could lead to overfitting and hinder the interpretability of outcomes. For this reason, here we recommend keeping K to 2 or 3 as a good trade-off between capturing the detail in the data variability whilst keeping the components meaningful without overcomplicating the model.

Discussion and conclusions

Our work constitutes a novel approach to the study of urban structure. It makes use of a probabilistic modelling framework based on Poisson mixture models and new forms of data. Simply by analysing data related to the length of the journeys taken on the public transport system, the proposed probabilistic modelling framework enables us to disaggregate the journeys into several statistical subpopulations according to their destination station and their length, measured as network distances between stations. The methodology relies on the monocentric hypothesis, a null hypothesis stating that in a perfectly monocentric city, there is only one centre, which we call nucleus, where all the socioeconomic activity is concentrated. Consequently, in the monocentric hypothesis, all the journeys terminating at any station other than the nucleus should have the nucleus as their origin station. Analysing deviations from the monocentric hypothesis allows us to infer the degree of polycentricity of a city by detecting the potential presence of centres other than the nucleus.

Applying the proposed general methodology to the two specific case studies of London and Seoul leads us to the following key findings. Firstly, the distribution of the length of journeys terminating at any destination station displays a high degree of variability in both London and Seoul. Secondly, by modelling the length of journeys terminating at each station in the public transport system with a single-component Poisson distribution, we observe that, especially in the case of Seoul, the most frequent journey length terminating at a given station is shorter than the distance between the nucleus and the station, perhaps due to the presence of closer, more local urban centres. Thirdly, by introducing the 2-component Poisson mixture model, we are able to classify each of the individual journeys to a station into what we call the proximal and distal components. The proximal component corresponds to journeys that take place at a local scale and the distal component involves journeys that take place at a global, city-wide scale. We see that regardless of the distance between the destination station and the nucleus, the most frequently observed journey length associated with the proximal component is around 5, as measured by network distance or number of stops away. These observations are particularly clear in the case of Seoul and are likely to be the consequence of the existence of other socioeconomic centres, closer to the destination station, where passengers may prefer to travel to carry out some socioeconomic activities at a more local level. Conversely, the most frequently observed journey length associated with the distal component is larger than the distance from the nucleus to the destination station for destination stations that are close to the nucleus, showing that passengers who terminate their journey at one of these stations may be travelling not only from the nucleus, but also from other origin stations that are further away. As the distance from the nucleus to the destination station increases, the most frequently observed journey length associated with the distal component increases fast, indicating that the distal component captures long-distance, city-wide journeys from origin stations that are possibly close to the nucleus, to stations that are in the peripheral regions of the city. Finally, increasing the number of components in the Poisson mixture model can help unpick details in the variability of the data; however, it can also make the model too complex and result in overfitting. After testing for other choices of nucleus (i.e. all the stations within a buffer distance from the initial choice of nucleus), we find that the observed patterns described above still hold.

Understanding urban structure from data related to the transportation system of the cities has significant implications for urban policy. The methodology outlined in this paper provides with a solution to the elusive problem of quantifying the degree of polycentricity for different urban areas. We argue that our proposed method resolves the issue of the fuzziness in the concept of polycentricity hence allowing for better-defined terminology, which could help policy-makers convey ideas about urban structure in a more assertive way. In particular, the recent quest for centralising activities in more compact cities, where the emphasis has been upon inner and city centre living, could be much informed by this approach where the difficulty of moving towards more compact urban structures might be measured by the different parameters associated with the Poisson mixture distributions. In this sense, these distributions are not only able to reveal how the centricity of cities might change under different travel regimes but also how travel behaviour itself might be altered.

Additionally, our approach is also of relevance to those interested in the most theoretical, and even historical, aspects of urban areas and it prompts questions for future research. As our findings for the specific cities analysed here indicate, London’s case aligns more with the scenario depicted by the monocentric hypothesis than Seoul’s. The construction of London’s transport network started at the end of the 19th century while the construction of Seoul’s started in 1971, approximately a century later, and in the course of all those years, several studies have reported a tendency for cities to become more and more polycentric. Assuming that the layout of the transport network and the passengers’ travelling behaviours are yet another manifestation of urban structure, then the fact that our findings suggest that London is more monocentric than Seoul, should not come as a surprise. But, does this assumption hold in general? Has London’s early construction of a public transport network conditioned its urban structure and slowed its transition towards a more polycentric arrangement like Seoul’s? There is considerable scope for extending this type of analysis to the evolution of past cities, developing, where possible, ways in which mixture models like these can reveal how spatial behaviours can and have altered over decades. Public transport data will always be an issue for such historical analysis but these methods can easily be extended to other trip distributions such as the journey to work on different modes and very different time intervals where there is data available.

This work is limited by the fact that the data sets used to illustrate the method are relatively small since they only contain data for two cities, a few weekdays and only some modes of transport. But we should reiterate here that the focus of this paper is on the proposed methodology and the analysis of data from London and Seoul is only used as a means to illustrate how to apply the method. The study is also limited in the sense that we are only able to consider the length of each journey as a network distance since the tap-out data set that was available for the analysis only shows, for each destination station, the number of journeys taken from a number of stations away. The fact that the origin station is unknown leaves us unable to consider, for example, other valuable information such as the Euclidean distance or the time duration between the start and end of each journey. Analogous limitations would apply to the tap-in data set. The main reason for these limitations is the difficulty in obtaining information about individual journeys due to data protection issues. Finally, our work only explores one temporal scale, but a deeper understanding of urban structure could be obtained by studying data at different times of the day as well as time periods of different lengths, and by analysing the temporal evolution of the data. However, as demonstrated above, the limited data is sufficient for the purpose of validating our methodology.