Demography and the emergence of universal patterns in urban systems

Urban areas exist in a wide variety of population sizes, from small towns to huge megacities. No proposed form for the statistical distribution of city sizes has received more attention than Zipf’s law, a Pareto distribution with power law exponent equal to one. However, this distribution is typically violated by empirical evidence for small and large cities. Moreover, no theory presently exists to derive city size distributions from fundamental demographic choices while also explaining consistent variations. Here we develop a comprehensive framework based on demography to show how the structure of migration flows between cities, together with the differential magnitude of their vital rates, determine a variety of city size distributions. This approach provides a powerful mathematical methodology for deriving Zipf’s law as well as other size distributions under specific conditions, and to resolve puzzles associated with their deviations in terms of concepts of choice, symmetry, information, and selection.

T he observation of approximate power law distributions of type frequencies is very common in complex systems [1][2][3][4][5] . The emergence of such distributions is known to be intimately connected to multiplicative growth processes mediated by the structure of complex networks of interaction 6,7 . Many different effective models have been proposed to derive particular forms of these distributions. They share common characteristics, such as invoking proportional growth (also known as Gibrat's Law) or preferential attachment and starting out at some mesoscopic scale, typically not derived from the fundamental population dynamics of the system [7][8][9][10][11][12][13][14][15][16][17][18] .
Among the many statistical patterns that characterize size distributions, none has received more attention than Zipf's law [17][18][19][20][21] . Specifically, Zipf's law applied to an urban system states that the expected population size of a city N is a function of its size rank, r, with an exponent z = 1, where N 0 is the size of the largest city (r = 1). In other words, the largest city is predicted to be twice the size of the second largest city (r = 2), three times larger than the third largest (r = 3), and so on. This is equivalent to a Pareto probability density for city sizes, P z (N) = c/N 1+z , with c a normalization constant. First observed by Auerbach for city sizes in 1913 22 , Zipf's law gained widespread attention through the work of the linguist George Kingsley Zipf, who established the rank-size rule first in the context of word frequencies in the 1930s 23 and then cities 24 . Zipf's law quantifies the concept of urban hierarchy, which, together with ideas of central place, location theory and other "laws" characterizing statistical regularities in urban systems, became the basis for the quantitative revolution in geography in the decades after World War II. Systematic analyses of urban systems from that time already suggested that the city size distribution is not universal, acquiring different forms as a result of distinct historical patterns of growth that do not have a relation to levels of economic performance or human development 25 . Like other power laws in complex systems, Zipf's law has also attracted much scrutiny and criticism over time 6,17 . It is hardly ever observed empirically in an unambiguous way, with data often showing systematic deviations at the two extremes of very large and especially of very small cities [26][27][28][29][30][31][32][33][34] . Besides such systematic variations, estimates of the exponent z take substantial ranges across time and for different urban systems, for example in China 27,35,36 , the United States 20,21,37 , and European countries 9,38 .
In our view, the empirical controversies about the form of the city size distribution and the interpretation of associated deviations can only be resolved through more fundamental approaches that start out with a system's basic population dynamics. It is also in the context of population dynamics that we can make sense of processes of selection and randomization that order and disorder complex systems. In this paper, we adopt the lens of evolutionary population dynamics to show the emergence and meaning of Zipf's distribution as a neutral law, signaling the absence of selection. It follows that associated deviations acquire the meaning of information, specifying people's preferences about where to live, constitute a family, and die. These results allow us to discuss and resolve a number of puzzles surrounding the interpretation of Zipf's law as a signal for urban system integration and associated implications for the coherence and symmetry of population states.

Results
Demographic dynamics of urban systems. We now derive the population size distribution of cities in the most general way, set by their demographic dynamics. The change of the population N i in city i during unit time Δt necessarily results from the balance of births, deaths, and migration as where N c is the total number of cities, υ i = b i − d i is the city's vital rate, the difference between its per capita average birth rate, b i , and the death rate, d i . The current J ij is the number of people moving from city i to city j over the time interval Δt, and vice-versa for J ji . For simplicity of notation we will work in units where Δt = 1. Eq.
(2) only accounts explicitly for migration between different specific cities, which can cross political borders. Migration to city i from non-specific places outside the system, for example from non-urban regions or other nations, can be incorporated into the vital rates υ i . Note that this model is very general, and naturally accommodates the inclusion of new cities over time when their sizes become nonzero, and indeed the disappearance of others, if their size vanishes.
To proceed we need to explore the dependence of the currents J ij on population size. We start by writing these currents as population rates, J ij = m ij N i (t), where m ij is the probability that someone in city i moves to city j over the time period. This allows us to write Eq. (2) in matrix form where NðtÞ ¼ ½N 1 ðtÞ; ; N i ðtÞ; ; N N c ðtÞ T is the vector of populations in each city at time t and the matrix A projects the population at time t − 1 to time t. This matrix is known in population dynamics as the environment 39 , with elements where m out i ¼ P N c k¼1;k≠i m ik is the total probability that a person migrates out of city i per unit time. We also define the population structure vector x = [x i ], with x i = N i /N T , which is the probability of finding a person in city i out of the total population N T . Note that P N c i¼1 x i ¼ 1, so that the structure vector has N c − 1 degrees of freedom (the magnitude of x is fixed).
Equation (3) can now be solved by repeated iteration City size distributions in different environments. A number of important ergodic theorems 40 in population dynamics tell us about the properties of the solutions under different conditions on the environments A(t). These results rely on some constraints on the properties of A; specifically, that the product of matrices in Eq. (5) is positive for sufficiently long times 39,41 . First, the weak ergodic theorem guarantees that for a dynamical sequence of environments, the difference between two different initial population structure vectors decays to zero over time. This means that there is typically an asymptotic city size "distribution", which is a function of environmental dynamics only, independent of initial conditions. When the environment is stochastic but otherwise time independent, the strong stochastic ergodic theorem states that the structure vector becomes a random variable whose probability distribution converges to a fixed stationary distribution, regardless of initial conditions. This is the sense in which most derivations of Zipf's law apply 17,33 . For stochastic environments, only probability distributions of structure vectors, not vectors themselves, can be predicted. Finally, in situations where the environment is time dependent and stochastic, the weak stochastic ergodic theorem states that the difference between the probability distributions for the structure vector resulting from any two initial populations, exposed to independent sample paths, decays to zero. Again, in cases where the environment is explicitly dynamic, besides being stochastic in a stationary sense, we cannot say much about the actual probability distribution of city sizes, only that the importance of initial conditions vanishes for sufficiently long times.
To get more intuition on these results we will now show some explicit solutions. The city size distribution can be calculated exactly when the environments A are arbitrarily complicated, but static. This is the result of the strong ergodic theorem, which says that in a constant environment A, any initial population vector converges to a fixed stable structure given by the leading eigenvector of the environment 39 . This is guaranteed to exist if A is a strongly connected aperiodic graph, meaning that any city can be reached from any other city in a finite and diverse number of intermediate steps along the non-zero migration flows between them. This is a reasonable assumption to make about an urban system, which may even serve as a definition. The leading eigenvector of A is the eigenvector centrality of the urban system. It describes the probabilistic location of a random walker over the graph of migration flows 42 . This is equal to the stationary solution of numerically integrating a network described by environment A. Figure 1 shows two numerical solutions with different initial conditions, but the same environment A. The cities in the two runs converge to similar sizes, as they experience a common environmental matrix with all entries set to be the same plus a little noise. The insets show how the population structure converges from both initial conditions to the one defined by the leading eigenvector e 0 . Because the structure vector is the probability of finding a person out of the total population in the urban system in a specific city, we use the Kullback-Leibler (KL) divergence 43 D KL ðPjP e 0 Þ ¼ P x P½xlog P½x=P e 0 ½x as a measure of the 'distance' between the initial distribution, P, and the final, P e 0 . The KL divergence is a foundational quantity in information theory. It measures the information gain in describing a probabilistic state using one distribution, P e 0 , versus another, P, in units of information (bits). Note that P ! P e 0 and thus D KL ðPjP e 0 Þ ! 0 over time, see insets in Fig. 1.
The convergence rate of the city size distribution to the leading eigenvector is given by the ratio of secondary eigenvalues to the dominant one. Explicitly, we can now write this solution as where Ae i = λ i e i , so that e i are the eigenvectors of the matrix A and λ i the corresponding eigenvalues, so that λ 0 > Re(λ 1 ) > Re (λ 2 ) > . . . , where Re(λ i ) is the real part of the eigenvalue. The projection coefficients c i are such that Nð0Þ ¼ P N c i¼1 c i e i , which can be written as c = E −1 N(0), where E is a matrix whose jth column vector is e j . The strong ergodic theorem states that the largest eigenvalue λ 0 is positive and real and that the associated eigenvector e 0 is also positive in terms of all its entries. This means that, over time, the dynamics of the urban system approaches that of the growth of its dominant eigenvector, because the projections on all other eigenvectors decay exponentially in relative terms, as and exemplified in the two insets in Fig. 1. The longest characteristic time is associated with the difference in magnitude between the two largest eigenvalues t Ã ¼ 1=ln λ 0 jλ 1 j . Consequently, for t ≫ t Ã we observe an extreme dimensional reduction from an initial condition characterized in general by N c degrees of freedom to a final one with just one, set by the environment's leading eigenvector. Using the values of migration flows in the US urban system over the last decade, we can compute the value of t Ã . It is rather long-of the order of several centuries-when measured using census data 44 and a little shorter using data on tax returns 45 .
These results allow us to consider very general classes of dynamics and initial conditions. We therefore conclude that in general, given a specific set of demographic conditions, the total population can grow exponentially with a common growth rate across cities. In addition, the relative population distribution at late times does not resemble anything like Zipf's law. As expected from the statements of the ergodic theorems, the population structure is set by the nature of the intercity flows (and vital rates), or equivalently by the environment A.
Stochastic demographic dynamics and symmetry breaking. The step towards a statistical solution for the city size distributionand obtaining Zipf's law in particular-requires the additional consideration of statistical fluctuations, which must arise in the vital and migration rates. Just as in models of statistical physics, the introduction of stochasticity implies not just fluctuating quantities but potentially also the restoration of broken symmetries 46,47 . In the context of urban systems, restoring broken symmetries means removing preferences for population growth a b   Fig. 1 Temporal evolution of the relative population structure of cities in the same environment for different initial conditions. Lines shows the trajectory of each city in terms of the fraction its population, N i to the total N T . a shows an initial situation where all cities start out with similar sizes, whereas in b they are initiated following Zipf's law (shown in log-scale). In both cases, the relative city size distribution eventually becomes stationary at long times (vertical red line) with the same population structure. This is given by the eigenvector e 0 , corresponding to the leading eigenvalue of the environment. The insets show how the system converges in both cases to this common structure set by the environment, because the Kullback-Leibler divergence D KL ðPjP e 0 Þ approaches zero at late times.
in specific cities over others, associated with imbalances in vital rates and migration flows. To make these expectations explicit we write the migration rates m ij as Eq. (8) splits the migration flows into two parts: s ij describes the symmetric flow between two cities (s ij = s ji ), and δ ij the antisymmetric part (δ ij = −δ ji ). Note that any correlations between N i and N j are preserved in these two quantities. Making the population size of the destination city explicit will allow to diagonalize Eq.
(2) and therefore to find self-consistent solutions for the structure vector. This form of the migration currents also makes contact with the gravity law of geography, which is typically written as In this form, the gravity law is symmetric, corresponding to Eq. (8) ij is a decaying function of distance d ij and γ is the distance exponent. Only the anti-symmetric part contributes to the dynamics of population size change, since we can write Eq. (2) as We can now appreciate the importance of the bi-linearity of the migration flows on both the origin and destination population as a means to diagonalize the demographic dynamics. We have achieved this however at the cost of introducing a slow non-linearity in each city's growth rate, via their dependence on the population structure vector x. We can establish the existence of a self-consistent solution x* such that for long times δ ij x i x j ¼ 0, by the conservation of the migration currents. Note that we did not assume any explicit form for the migration flows as a function of distance between places, so that our discussion includes many specific models, including different version of the gravity law.
This equation has a clear geometric meaning: The matrix δ = [δ ij ] is anti-symmetric and real, so that it behaves as an infinitesimal generator of rotations: R(x) = x + δx. We can write the self-consistent solution in terms of these rotation as where υ is still a function of x and υ = [υ i ]. Note that these rotations are associated with very small angles, and that the structure vector is subject to constraints, such as staying positive.
The first two terms of Eq. (12) ask for a general solution for x that is invariant under rotations. However, the last term, which introduces differences in relative growth rates for different cities, breaks this symmetry explicitly. This specificity of the relative growth rates leads to particular solutions that do not coincide with Zipf's law. Fig. 2a shows the example of a numerical solution in a non-linear, non-stochastic environment, defined by Eq. (8).
The final population structure in this environment is still well defined, as it reaches a steady state. However, compared to the time independent case (see Fig. 1), the final structure takes much longer to unfold and has a stronger urban hierarchy.
Symmetry restoration and the emergence of Zipf's law. The stage is now set to consider what it takes to counter the symmetry breaking resulting from the selective structure of the environment.
To do that, we must consider the statistics of the demographic rates. We start by rewriting Eq. (10) in terms of the dynamical equations for the structure vectors x as We now specify the relative growth rate ϵ i ¼ υ i À υ À δ i for each x i as a statistical quantity. It is clear by construction that its average over cities vanishes, ϵ i ¼ 0. The only remaining question is how ϵ i varies over time. There is strong empirical evidence that on the time scale of years to decades some cities can maintain larger growth rates than others. For example, presently in the US, most cities of the Southwest and South are growing fast, while other cities show slower growth or relative decay, such as many urban areas in the Midwest and Appalachia 48 . Moreover, it has been true in recent decades that mid-sized cities in the US have been growing faster than either the largest cities or small micropolitan areas, so that even population aggregated growth rates over certain size ranges differ. Going further back in time one could argue that the same cities of the Midwest were once booming as they became the manufacturing centers of the US, especially between the Civil War and the decades post WWII 49 . Similar arguments may be made for cities in other urban systems, such as in Europe 50,51 . Thus, it is possible that on very long time scales, T-on the order of a century or longer-the growth rates of cities equalize to very similar numbers and therefore the temporal average This requires that the temporal average of Eq. (11) remains zero, which means that the structure vector is both rotated and dilated 'randomly' in ways that over sufficiently long times lead to 〈ϵ i 〉 = 0, for each city. Numerical solutions of the demographic equations under these conditions show a population structure that closely resembles Zipf's law for long times, see Fig. 2b.
In the following, we show that Zipf's law is a probability density for this derived dynamics in the long run. To do this, let us characterize the variance in the temporal growth rate fluctuations as Over the same long time scales, statistical fluctuations in the growth rate may obey a limit theorem, so that they become approximately Gaussian, with variance, σ 2 i . Then, Eq. (13) becomes simpler, and acquires a direct correspondence to familiar stochastic growth processes. It can be written in analogy to geometric Brownian motion without drift as where W is a standard Brownian motion. We now see how demographic dynamics can be simplified through a chain of assumptions into a form consistent with typical stochastic processes leading to Zipf's law as proposed by Gabaix 20,52 . Equation (14) can be expressed as an equation for the probability density P ≡ P[x i , t|x i (0), 0] of observing x i at time t, having started with an initial condition where x i (0) was observed at time zero, The last term describes a probability current, This equation is exactly solvable, see Methods for technical details. To find the stationary solutions, we ask that the right hand side of Eq. (15) vanishes. The first of the two solutions corresponds to a vanishing current, independent of x, as proposed by Gibrat's law, we can drop the index so this becomes Zipf's law, PðNÞ ¼ P z ðN i Þ $ c N 2 . Note for example that if σ 2 i $ N Àα (α may be negative), which violates Gibrat's law in terms of fluctuations of the growth rate, then P~1/N 2−α . Thus, the city size dependence of growth rate fluctuations will change the exponent in Zipf's law and may destroy scale-invariance altogether if it is not a power law. The second solution applies for constant current, i. e. J x = J, which leads to P ¼ c 00 σ 2 x , where c″ is a normalization constant. This represents a flow of probability across the urban hierarchy. Both solutions are attractors of the stochastic dynamics that emerge for long times, t ≫ 1/σ 2 .
We find that in the limit where the relative difference in vital rates vanishes due to fluctuations, the structure vector becomes rotationally invariant on average and the angular symmetry of x is effectively restored. This leads to an effective symmetry of the relative city size distribution on the time scale t r = 1/2σ 2 , when growth rate fluctuations vanish. The time t r can be estimated using US data to be typically very long, on the range of many centuries or even millennia. Under these conditions all cities share statistically identical dynamics and can be interchanged, resulting in Zipf's law as the time average of population size distributions, see Fig. 3. In the same limit, the anti-symmetric components of intercity flows will average out and the gravity law will emerge in its conventional form. However, the observations of city sizes at each time are samples of this distribution and may reflect decade-long observable positive and negative preferences for certain cities. Figure 4 shows this effect for the US urban system using decennial census data from 1790 to 1990. For time periods of a few decades, the temporal average does not visibly converge to Zipf's law: The blue line in the inset of Fig. 4 depicts this effect. However, the cumulative temporal average of the size distributions starts to approximate Zipf's law after about five decades. The red line (inset) shows the KL divergence between the cumulative temporal average and Zipf's law, starting from 1790 to subsequent census.

Discussion
We have shown that the demographic dynamics of urban systems can result in different city size distribution and how Zipf's law, the gravity law and other coarse-grained statistical regularities in geography emerge approximately from these dynamics as averages when certain symmetries are restored over sufficiently long times.
Several of the properties of Zipf's law that have been controversial in the literature can be clarified from this perspective. First, Zipf's law is not a "unique signal for the integration of cities in an urban system" 54 ; it is merely one distribution of city sizes among many less symmetric ones that result from such integration. We have shown in this respect that integration of an urban system depends on strong intercity migration flows-but not necessarily the symmetric ones that might lead to Zipf's law. Intuitively, such flows mix people up between places creating a single common dynamics across all cities that is not associated with any particular kind of economic or political organization beyond enabling free intercity migration. However, asymmetries in the migration flows (and, thus, deviations from Zipf's law) might originate in systemic properties, but can have a manifold of other reasons-such as socio-economic opportunities, natural disasters, or demographic change-or originate from combinations of them. Second, in a recent article Cristelli et al. 54 make the important point that the properties of Zipf's law are systemic and cannot be understood simply on the basis of its mathematical form as a power law. They show that Zipf's distribution is coherent in that subsamples of a population distributed according to Zipf's law do not reflect the same statistics. They also emphasize that because the largest city in the system sets the overall properties of rank, the distribution makes no sense without it, something they call screening. These properties can be understood through the conservation of the current J x and its dynamical consequences. If we replace any other probability distribution, including a normalized subsample of Zipf's law, into the current we will obtain a nonzero drive to the entire system dynamics because P[x] is a field in x. This systemic response signals a lack of coherence and eventually restores P[x] back to Zipf's distribution. Similarly, a set of cities organized such that their relative populations has J x = 0 ensures that other cities cannot easily take their sizes and ranks, leading to screening as a result of the global conservation law of the probability current. Under these conditions there can only be churn in the relative positions of cities in the rank hierarchy. The characteristic extinction time for a city with initial rank r to fall below the lower boundary x < x m can be read from the solutions in Methods to be hΔtðrÞi $ 1 2σ 2 ln r max r ¼ t r ln r max r . The longest time, for the largest city to fall off the system, is hΔtðr ¼ 1Þit r ln N c , which is the product of the characteristic time for the reversal of city growth rates, t r , multiplied by ln N c ¼ ln r max , which is a measure of the depth of the urban hierarchy.
To better appreciate the commonly observed deviations of the largest and smallest cities in the distribution, consider that the current can be integrated over any finite range of x ∈ [x l , which is clearly satisfied by Zipf's distribution with z = 1. This shows how the statistical dynamics of population sizes depends on values at the extremes of x, the maximum, x M and minimum x m . While it may be natural to set x M = 1 with a small probability, it is more problematic to choose x m 's value, thereby defining the population of the smallest possible city in the sample. In particular, because x m = N m /N T > 0, we must specify how places with population N < N m relate to our distribution at the lower boundary. The vanishing current condition tells us that there should be a Zipfian "ghost" distribution of cities, such that P½x < x m ¼ x m x P½x m , with lots of small cities. Expanding on a point stressed by Saichev et al. 5 , this means that the boundary conditions, especially for small cities, are critical in shaping the resulting distribution. This issue betrays a failure of true scale-invariance hidden in Zipf's law. The solution of Eq. (14) in the absence of the vanishing current is a lognormal distribution 5 with average given by ln xðtÞ ¼ ln xð0Þ À ðσ 2 =2Þt (see Methods). This means that, in the absence of boundary conditions at x m , the distribution will become more and more peaked at smaller and smaller values of x over time. The conservation of the current stops this decay from happening, because it effectively injects some compensatory probability for small cities getting smaller, from other even smaller cities (not in the sample) getting larger. This logic requires an immense number of smaller and smaller cities. For example, if like in the US or China the largest city in the system has~20 million people, one needs 20 million cities with one person, 2 million with 10, 200,000 with 100 and so on, which is not observed. This inevitable deficit of small towns relative to Zipf's law will always lead to a loss of scale invariance for high ranks and the need to understand the nature of demographic processes at play in this regime on the basis of more fundamental processes, returning us back to the demographic dynamics of Eq. (2).
Zipf's law acquires a special status among other possible distributions in light of the property of neutrality, which emerges as the rotational symmetry of the dynamics is restored over long times. The term refers to the absence of selection in population dynamics, see Methods. As such, resulting distributions are maximally disordered given additional constraints (maximum entropy). This means that deviations from Zipf's law express selection of people towards certain places (resulting from temporary ϵ i ≠ 0) and should be measured in terms of information. In this respect, we can measure the surprise of an observed state of the system relative to the Zipfian expectation as SðxÞ ¼ log P½x P z ½x , and the total information (total surprise) in the choices underlying the actual observed distribution relative to the neutral situation (Zipf's law) in terms of the Kullback-Leibler divergence D KL ðPjP z Þ ¼ S, as we have done in Figs. 1, 3, and 4. Those measures additionally provide a way to compare structures of systems at different times and can reveal temporal trends, signaling basic structural changes in the underlying demographic processes. Fig. 4, for example, depicts a strong and long lasting  Fig. 4 The dynamics of the rank-size distribution for cities in the US between 1790 and 1990. Population structure distributions for the largest 100 cities in the US 53 are shown as dashed lines, while the average over all years is shown in red. The inset depicts the D KL (P|P z ) to Zipf's law of the population structure for each available year (blue) and of the cumulative average over all structure vectors up the specific year (red). We see that the temporal average steadily approaches Zipf's law as more years are added. After an initial phase, the average is always closer to Zipf's law than the distribution in any single year. Note that in recent decades, the population structure vector began consistently deviating from Zipf's law, leading also to a small divergence of the cumulative temporal averages. This effect is to a large extent the consequence of mid-sized and large metropolitan areas growing faster over this period than the largest cities in the US urban system.
increasing deviation from Zipf's law for the US. It might be a signal that the US is in a transitory state with strong preferences for certain regions since the 1940s, such as mid-sized cities in Texas and the Southwest. However, such statements for specific urban systems need further context-specific investigation, which is beyond the scope of this article. In summary, starting with the most general demographic equations, we can derive many instances of integrated urban systems with city size distributions that differ substantially from Zipf's law. We can understand the process by which universal patterns in geography emerge as a sequence of situations that restore the symmetry of the demographic dynamics and ultimately rely on averaging stochastic behavior of vital and migration rates over sufficiently long times. Seeing Zipf's law and other laws of geography in this light helps us appreciate the information content of associated deviations in real urban systems, and trace them back to specific choices and preferences associated with people's agency in terms of births, deaths, and migration. Formally, the approach developed here for urban systems relies only on the analysis of the transition probability between types against a general background of multiplicative growth dynamics. As such, it is very general and readily applies to other situations in complex systems where rank-size rules are also observed approximately.

Methods
Probability solution for geometric random growth with boundary conditions. The Fokker-Planck equation for random geometric growth without drift is where P = P[x(t), t|x(0), t 0 ] is the conditional probability of observing state x of the random variable at time t, given the initial state x(t 0 ) at time t 0 . For simplicity of notation, we have dropped the i indices in x and σ and write x(t 0 ) as x 0 . Equation (17) has some similarities with the diffusion equation in physics and can analogously be solved exactly. First, it is useful to change variables so as to eliminate the non-linear term x 2 . We set y ¼ ln x x 0 and τ ¼ σ 2 2 ðt À t 0 Þ. By changing the variables in Eq. (17) we find This equation is now linear and can be solved in two ways. The first way is using factorization and solve it as a heat equation and The second way is to solve it directly via a Fourier transform, so that which leads to This equation can be solved via a separation of variables, P[k, τ] = f(k)T(τ), which leads to an eigenvalue problem: Solving for k we obtain: with where P[y, 0] = f(k)T(0). In particular, there are two stationary solutions, for w = 0, with k = k 0 = 2i and k = k 1 = i. Substituting k 0 , we see that the solution corresponds to P $ e À2y ¼ 1 x 2 , which is Zipf's distribution. The other solution corresponds to the existence of a constant probability current up or down the urban hierarchy associated with different boundary conditions, see main text. In order to obtain Zipf's law as the stationary distribution for long times, we must add constraints to the geometric random growth dynamics. To see this more explicitly, we consider the full dynamical solution, which can be written in terms of y as, To obtain Zipf over the long run, three conditions are needed: First, the integration constant α 2 has to be zero when the probability current is set to zero. In terms of y, this reads as JðyÞ ¼ e À2y d dy σ 2 e 2y P½y; τ ¼ 0: Second, the constant α 1 can now be fixed by normalization of Zipf's distribution as a probability: , where x M and x m are the upper and lower boundaries of city sizes, respectively. A natural choice for the upper boundary is x M = 1, whereas the choice for the lower boundary is not obvious, see main text. Third, we can now set the boundary condition on the time varying solutions, which are now gðy; τÞ ¼ P½y; τ À α 1 e À2y ð29Þ and must be taken to vanish at the boundaries in y. The integral over the range of city sizes needs to be zero to preserve the probability normalization. For example, by asking that these solutions are real and vanish at the boundaries, meaning g(y = y m , τ) = g(y = y M ) = 0, we get g½y; τ ¼ X n a n ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 y M À y m s sin k n ½ðy À y m Þ À 3τe Àðk 2 n À2Þτ ; Even though these functions now obey the boundary conditions, the temporal structure is similar to the case with no boundary conditions and the decay of the initial amplitudes occurs on a time scale set by τ, which is t À t 0 ¼ 1 2σ 2 and can be very long for small volatilities, σ.
Lognormal solution in the absence of conserved current. We can find the general solution in Eq. (25) with the initial condition P[y, 0] = δ(y), which corresponds to the case N(t = 0) = N T . Then P[k, 0] = 1 and the solution for all times is Returning to our original variables this becomes which is a lognormal with log-mean hln xi ¼ ln x 0 À σ 2 2 ðt À t 0 Þ, and log-variance hðln x À hln xiÞ 2 i ¼ σ 2 ðt À t 0 Þ. As a result for late times, in the absence of a boundary condition that sets the current J x , the city size distribution becomes peaked around smaller and smaller sizes, and become broader and broader.
We can define an "extinction time" 〈Δt(r)〉 as the expected time interval for a city of initial rank k to fall through the lower boundary condition, at x = x m . This is defined implicitly through the conditional probability P[x m , t; x 0 , t 0 ] = p~1, where x m is the minimum size, which close to Zipf corresponds to maximal rank r max $ 1=x m . Observation of the exact solutions above tells us that the leading time is hΔtðrÞi $ 1 2σ 2 ln x x m ¼ 1 2σ 2 ln r max r . There are subleading terms that depend on the exact initial condition. The longest extinction time belongs naturally to the largest city (with initial r = 1): it is hΔtðr ¼ 1Þi $ 1 2σ 2 ln r max , which depends on how many cities there are in the urban system because r max = N c .
Neutrality of population dynamics and Zipf's law. In evolutionary population dynamics, a change in the relative probability of types (in the absence of errors) is attributed to selection. The situation when selection is absent and only statistical fluctuations drive the dynamics is known as neutral dynamics.
We noted in the main text that the structure vector x i ðtÞ ¼ N i ðtÞ N T ðtÞ , i = 1, . . . , N c is the probability of finding a person in city i at time t out of a total population NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18205-1 ARTICLE N T (t), across all cities. In evolutionary population dynamics we write the evolution of this probability as The quantity w i (t) is the fitness of state x i , because if w i (t) > 1 this state becomes more common in the updated population and vice versa when w i (t) < 1. Note that by normalization of the updated frequency the population average P i x i ðt þ 1Þ ¼ 1 ¼ P i w i ðtÞx i ðtÞ ¼ wðtÞ. Comparing to Eq. (13), this establishes that w i = 1 + ϵ i . Positive (negative) selection correspond to the situation when w i > 1 (w i < 1). Neutrality corresponds to w i = 1 → ϵ i = 0, which we showed to be necessary to obtain Zipf's distribution. Consequently, Zipf's law is a neutral distribution.
Note that Zipf's law is not the only neutral distribution for multiplicative dynamics, but becomes unique when we impose the additional condition, J x = 0.
Additionally ln w i has a meaning as information. To see this write, where P(A) is the probability of a particular environment and p(x i , A) is the joint probability of the population structure and of states of the environment. The idea is that the updated probability is the structure vector given (conditional on) the influence of the environment, as we wrote in Eq. (13), so that x i (t + 1) = P(x i |A(t)). Thus, log w i once averaged over x and A is the mutual information between the population structure and the environment. When the evolution is neutral P(x i , A) = x i P(A), the two variables are statistically independent and there is no information from the environment being encoded in the population structure.
Since the environment, in our case, is the space of preferences in vital and migration rates, there is no structure of these preferences encoded in the population structure when we observe Zipf's law. Only the deviations from it, when ϵ i ≠ 0, can in this sense give us information. Figs. 1-3 where obtained by iterating the respective environments, A, starting with several different initial population states, N(0), as expressed by Eqs. (3) and (5). We assumed that all environments are strongly connected graphs; the migration probability from city i to city j is non-zero, or A ij > 0. These parameterizations reflect features of real world urban systems such as the average annual fraction of population that moves between metropolitan areas in the US in the last few decades, which is about 1.8%, according to US Census Bureau 44 and tax returns reported by the Internal Revenue Service 45 . The data from both sources shows that the migration rates δ ij follow a lognormal distribution. We implemented the stochastic environments to reflect this statistical distribution. The introduction of stochasticity in Figs. 2 and 3 requires that we set boundary conditions at large and small x as discussed in the main text. The upper bound is less critical and a natural choice is to allow the whole population to concentrate in one city, or x M = 1. This state has very low entropy, so that it is extremely rare for it to occur by chance. In practice, it is never observed in numerical solutions even when stochastic fluctuations are strong. On the other hand, the boundary condition for small cities is violated all the time, as shown in Fig. 2. Varying the value of x m shows that the best value for the lower bound is of the order of the size of the smallest city according to Zipf's law (N(r = N c ) = N 0 /N c ), resulting in:

Time integration. Numerical solutions shown in
When a city's time evolution violates this lower constraint, our implementation resets the population size of the city back to its former size, typically just above x m , and adapts the other cities so that This is achieved by reducing the x i uniformly among all other cities without violating the lower bound during this process. This implementation of the boundary condition for small cities is necessary to obtain Zipf's law. It mimics the probabilistic effects of a ghost population of very small cities, as described in the main text.
Data. Data was obtained from two main sources, the US Census Bureau (USCB) and the US Internal Revenue Service (IRS). USCB provides a data set for the complete population in the US on a decennial basis. We use CPS Historical Migration/Geographic Mobility Tables 44 as the main empirical basis in this paper. IRS data is released annually 45 . Migration rates are only implicitly given in this dataset. It provides locations of tax filings every year and corresponding (different) locations the year before. IRS data does not capture the whole population, since only the portion files tax returns (perhaps 85%). It nevertheless provides valuable insights into the migration patterns within the US. In both cases data are provided at the county level. We aggregate the data to the level of Metropolitan Areas (MSA). MSAs are functional cities, defined as clusters of contiguous counties that have strong socio-economic ties to an urban core, measured via commuting fluxes. For this reason, MSAs are definition of urban areas as integrated labor markets. A crosswalk from counties to MSAs is provided by the National Bureau of Economic Research (NBER) 55 .
This data allows us to build all entries of the environmental matrices A and their migration flows. The net vital rates are given implicitly in both data sets via the total population change for each county in every year. We assume that the parts of the growth rates that can not be explained by national migration, stem from the balance of births, deaths, and international migration. All references to data are based on the matrices constructed this way on the basis of USCB and IRS data, except for Fig. 4, which is based on another data set from USCB 53 .

Data availability
The data that support the findings in this study are available from the following sources: The main empirical basis on migration and geographic mobility data are available from USCB: https://www.census.gov/data/tables/time-series/demo/geographic-mobility/ historic.html.