The United Nations recently reported that there are about one billion migrants worldwide, of which one quarter are international migrants1. The size of the migration phenomenon and the speed by which it increases have turned migration into a challenge that is at the top of the political agenda in the European Union, the United States and in many other countries across the world. One reason why migration has become a major political priority is that it is a catalyst for large-scale social, economic and demographic changes2 capable of producing opportunities but also turmoil and friction.

Integration and social cohesion are keywords when addressing many of the challenges posed by increasing migration. Some of the problems related to such processes are carefully analyzed in several studies (for instance3 and the references therein). The European Union in particular has identified a list of common basic principles to make integration work4,5 based on employment for immigrant, frequent interactions between immigrants and natives, etc.

Our work stems from the simple observation that very little is known about the mechanisms that bring about integration. For example, elementary questions like how integration responds to an increase in immigration density or to what extent the intensity of interaction modifies the level of integration still beg coherent empirical and theoretical answers. Our study is inspired by the realization that precise answers to those questions are of paramount importance to formulating social policies able to promote integration.

The research reported here addresses these issues from a quantitative point of view without making use of the classical historic series approach (see the Supplementary Material for discussion). Instead we study the problem of integration from the hard-science point of view by relying on methods and techniques stemmed from statistical mechanics, the branch of theoretical physics devised to explain thermodynamic laws as emerging average behavior for systems composed of a high number of microscopic interacting elements. The application of ideas and methods from statistical mechanics to fields other than physics has emerged in several contexts over the past decades (see6,7,8,9,10 and11). Their development in quantitative sociology is ongoing12,13,14,15,16,17 and they have recently been used to study immigration phenomena18,19.

Following the available data, our research focuses on classical quantifiers of integration such as the fraction of temporary and permanent labor contracts given to immigrants, the fraction of marriages with spouses of mixed origin (native and immigrant) and the fraction of newborns with parents of mixed origin. We aim at a predictive theory by which the magnitude of the above-mentioned indicators can be expressed as a function of the density of immigrants , i.e. the ratio between the number of immigrants Nimm and the total population N = Nimm + Nnat, where Nnat is the number of natives

Since these quantifiers are the sum of many random variables divided by their total number and we are interested in studying their dependence on , what we seek is first the empirical law from real data and then the theoretical probability law that, in the limit of large numbers20, entails the observed collective behavior.

Within this work we are interested on the average values of the quantifiers, at the national scale. The quantifier fluctuations are shortly analysed only from the heuristic point of view (see the Supplementary Material). The problem of establishing if our mathematical theory predicts correctly their structure, as well as the study of the data at different scales, will be subjects of future works.

To introduce our approach, let us step back to a well-understood phenomenon of collective particle behavior like those considered in statistical physics. Let say we are interested in the basic matter of discovering whether those particles behave independently or not. In many instances, in particular for ferromagnetic particles, the knowledge of their collective observables, like the magnetization, is not enough to answer the question since we could observe the same magnetization for systems with or without interaction. Nevertheless, being able to control some parameters like the temperature of the system or the strength of the imitation coefficient (coupling constant) among particles or the external magnetic field, makes the problem easily solvable by the classical theory of magnetism21,22. When the interaction is negligible the response of the system to a small solicitation is well approximated by a linear function. If instead the imitation is dominating, the system would remain insensitive to solicitations up to a critical threshold and start responding very quickly once that threshold has been exceeded.

We want to point out that the same problem has emerged in a sociological context too, as was lucidly formulated by Max Weber23: Social action is not identical with the similar actions of many persons… Thus, if at the beginning of a shower a number of people on the street put up their umbrellas at the same time, this would not ordinarily be a case of action mutually oriented to that of each other, but rather of all reacting in the same way to the like need of protection from the rain.

Individuals, indeed, do behave sometimes independently from each other. For instance the McFadden's Nobel prize rewarded solution of the Bay Area Rapid Transit problem24 is based on a non-interacting system of individuals as pointed out in25 and26.

Imitative or, more generally, correlated behavior is nevertheless even more frequent. Classical examples are when the actions of others are imitated because they are fashionable, traditional, or lend social distinction23. Others have pointed out that individual decisions are made according to imitation even in situations that were previously not considered15,27. For example, sociologists have argued that imitation, or adopting the behavior of others, is a frequent and highly rational strategy when the consequences, social as well as personal, of one's actions are difficult to assess28. Therefore ex ante both types of behavior are candidates when seeking to explain how integration comes about.

Our research shows how it is possible to distinguish, using quantitative methods, whether the value of the integration quantifier follows from people acting according to some individual preferences independently of other people, as in Max Weber's rainfall example, or whether it follows as a result of social interaction with others and, of course, all the possibilities in between. The two extreme cases are described in statistical mechanics either as free theory–independent particles, perfect gas29 – or interacting theory with possible phase transitions6.

While it is easy to envision formal similarities between particle behavior and human behavior, the difficulty in our context is that the analogy needs to be strengthened beyond the formal level. In the social sciences, in fact, there is no natural notion of system temperature or, in other terms, it is not clear how to measure a cost function for social actions nor what units to use to perform the measurement. Moreover, there is no simple way to tune the strength of imitation between people or that of their individual tendency to decide toward some choice. Our proposal to overcome the obstacle is to consider, as control parameter of the Immigrants-Natives system, the quantity that tunes the total number of available cross-links couplings among the two populations:


The main results of our work are summarized in Fig. 1 where the average values of the quantifiers are shown at increasing densities (notice that, in a small neighborhood of zero the parameters Γ and coincide): while the quantifiers measuring labor market integration (green and yellow dots in figure) exhibit linear growth in Γ, the quantifiers capturing the intensity in mixed-marriages and the number of newborns with mixed parents (blue and red dots) display a non-linear behavior. The distinctive feature of this non-linear growth is its quick take off which progressively decreases at increasing densities. Hence, if we apply a linear theory when interpreting the whole database (grey line), we would underestimate the level of integration when the immigrant density is low and overestimate it when it is high. Moreover, an integration forecast based on a linear theory (black line) would lead to predictions that are twice as large as the observed values. A finely tuned fit of the blue and red points shows a functional shape proportional to starting very close to the origin of the axes (blue and red curves).

Figure 1
figure 1

Dots are average quantities versus Γ.

Left upper panel: quantifier Jp (green dots), the fraction of permanent labor contracts given to immigrants on the total of labor contracts, with the best linear fit (free fit) aΓ (a = 1.52 ± 0.05, goodness of fit R2 = 0.985). Right upper panel: quantifier Jt (yellow dots), fraction of temporary contracts given to immigrants, with the best linear fit (free fit) aΓ (a = 1.81 ± 0.09, with a goodness of fit R2 = 0.963). Left lower panel: quantifier Mm (blue dots), fraction of mixed marriages, with the best square root fit (blue curve) (c = 0.53 ± 0.02, goodness of fit R2 = 0.992), the best linear free fit (grey line) aΓ (a = 1.18 ± 0.07, with a goodness of fit R2 = 0.855) and the best linear extrapolation fit (black line) bΓ (b = 1.92 ± 0.07, for 0 < Γ ≤ 0.035, goodness of fit R2 = 0.964). Right lower panel: the quantifier Bm (red dots) fraction of newborns with mixed parents, with the best square root fit (red curve) (c = 0.28 ± 0.01, goodness of fit R2 = 0.984), the best linear free fit (grey line) aΓ (a = 0.64 ± 0.05, goodness of fit R2 = 0.789) and the best linear extrapolation fit (black line) bΓ (b = 1.04 ± 0.05, for 0 < Γ ≤ 0.04, goodness of fit R2 = 0.922).

The data analysis calls for a theory able to describe all the observed data on a unified mathematical framework: we proceed by building such theory by first observing that a labor contract, a marriage, a child birth are coupling relations among humans. In a two-group system such as a society composed of immigrants and natives, there can be in-group or cross-group couplings. Consequently, the choice between, say, marrying or hiring an immigrant over a native is dichotomous. To this end the Discrete Choice theory proposed by McFadden24 is a natural candidate to describe the frequency of cross-group couplings in large populations. McFadden's theory contains a crucial assumption of mutual independence between the involved random variables25. When applied to the case we are studying, that theory predicts a linear growth in Γ in a suitable interval. Our findings indicate first how well McFadden's theory works in assessing the level of integration in the Spanish labor market and suggest that the choices between assigning the job to a native or to an immigrant are made in a mutually independent fashion, case by case, no matter how other actors choose, with little or no peer-to-peer mechanism at all. Our results show, however, that the choice of marrying or having a child with a native or an immigrant partner is not well described by the classical discrete choice theory.

We argue that this mismatch arises because the latter decisions are no longer taken independently. Instead the action of marrying an immigrant or having a child with an immigrant is a decision that is contingent of how others in the environment have acted or not acted in this context before the decision at hand. As a first step and coherently with the concept of universality in statistical physics6,29, we do not consider the causes of such social influence (like observing and imitating already successful strategies, or imitate the behavior of people we trust, or comply with cultural norms) but we simply refer to all social actions that may cause interdependence or imitation in decision making.

Theories that relax the assumption of independence and cater to imitation, introduced in15 and developed in25,30,31, predict a square root behavior of the probability of cross-group couplings as observed in the marriage and child birth data. The methodological part of our work identifies a mathematical model capable of dealing with both situations for different values of the parameters. The result is a generalization of the monomer-dimer model32 with the addition of an imitative interacting social network component, with random topology in agreement with the small world-scenario33 (see the section Supplementary Material). The model we propose reduces to the classical discrete choice theory with linear growth in situations when imitation is negligible and to the square root behavior when imitation is prevailing.

The paper is organized as follows. The first section is Data and Integration Quantifiers where we introduce the data and define the quantities to be studied. In the second section Results and Discussion we present the findings and their theoretical interpretation, as well as some possible outlooks. Finally, in the Supplementary Material, articulated in Data Analysis and Elaboration and Statistical Mechanics Methodology, we outline details of our experimental and mathematical framework.

Data and Integration Quantifiers

We use immigration data from Spain on the time interval 1999 to 2010 because it corresponds to the period in which Spain received most of its immigrant population. The smallest geographical unit for which data are available is for the administrative unit, Municipality. Hence, in this work Location equals Municipality.

Data on local immigrant densities are compiled as follows. We use the size of the immigrant population and the native population in each municipality as reported in the 2001 Census as our baseline. Thereafter, we estimate the local immigrant densities for different points in time (quarterly) between 1999 and 2010. The analysis is based on the information contained in the Statistics over residential variation in Spanish municipalities, the so called Estadistica de variaciones residenciales (EVR): for each municipality, the data contain information about (internal) migration to and from other Spanish municipalities, as well as all international migration events and statistics on vital events (births and deaths), as elaborated by Spain's National Statistical Agency (INE). A unique feature of the Spanish data is that they also include so-called undocumented immigrants, that is, immigrants who lack a residence permit34. Undocumented immigrants are usually not included in official statistical sources. However, their share of the immigrant population is often significant and excluding them would underestimate the true size of the immigrant population and, most likely, change the nature of the studied phenomena. We point out that the data we analyzed didn't include detailed topological features on location to allow us to address issues like municipality segregation and mix of migrant groups.

Data on marriages and births are drawn from the local offices of Vital Records and Statistics across Spain (Registro Civil) and have been compounded by the INE. Data on marriages contain information about the time of the marriage as well as the place of birth, nationality, municipality of residence and the like, of all the spouses entering into marriage in Spain. By our definition, a mixed marriage occurs when a Spanish-born (native) person marries a person born in a foreign country. Similarly, data on births contain information about the place of birth, nationality, municipality of residence, among other things, of all the newborns parents. For the same reasons as with mixed marriages, we consider all newborns with one native and one foreign born parent to be newborns with parents of mixed origin. Information is included on mixed marriages and newborns with parents of mixed origin where the foreign-born spouse or parent is an undocumented migrant. We focus our analysis on birth and marriage events that occurred during the period 1999 to 2008. However, data on density, marriages and births are subject to minor data protection restrictions. An individual residence municipality is only disclosed if its population is larger than 10, 000. For this reason, out of approximately 8, 000 municipalities in Spain, our analysis focuses on only 735. Still, 85% of Spanish immigrants reside in the included municipalities.

Data on labor contracts come from Spain's Continuous Sample of Employment Histories (the so called Muestra Continua de Vidas Laborales or MCVL). It is an administrative data set with longitudinal information for a 4% non-stratified random sample of the population who are affiliated with Spain's Social Security. Sampling is conducted on a yearly basis. We use data from the waves 2005 to 2010. The inclusion of an individual in the sample is determined by a sequence in the individual's social security number that does not vary across sample waves. This means that individuals are maintained across samples. New affiliates with a social security number matching the predetermined sequence are added in each new wave. The data contain information on contractual conditions such as whether the individuals have a temporary or indefinite labor contract, as well as the contracts start and stop times. Residential data at the level of municipality and information about place of birth are also available. In contrast to the data on densities, marriages and births, for these data the residence municipality is only disclosed if the population is larger than 40, 000.

For those unfamiliar with the Spanish immigration context, the following brief information may be useful. In 1999, Spain received fewer than 50, 000 new documented and undocumented immigrants. Since then, annual immigration levels have increased dramatically, reaching a peak in 2006 and 2007, with inflows exceeding 800, 000 (see light gray bars in Fig. 2). Spain's documented and undocumented foreign born population has risen from little more than 1 million to over 6.5 millions in the analyzed period (see solid line in Fig. 2). Its share of the total population has risen from less than 3% to over 13% in the same period. Currently there are immigrants from almost all nations in Spain. However, some 20 immigrant origins account for approximately 80% of Spain's total immigrant population. Immigrants from Romania form the largest minority in Spain (767, 000 at the end of 2008), followed by immigrants from Morocco (737, 000 at the end of 2008) and Ecuador (479, 000 at the end of 2008). Europe and South America together account for over 70% of Spain's total immigrant population.

Figure 2
figure 2

International immigration and stock of foreign born population in Spain in the decade 1999–2009.

The inset highlights migrant income from specific (main) countries of origin.

Operatively, we derive two datasets based on the information described above. One contains data on marriages and newborns and the other on labor market affiliation. Both datasets contain spatial and temporal information, such as the municipal code, quarter, year and the immigration density in the municipalities across time. The data on labor contracts consist of 3, 553 entries over the period 2005–2010. The data on marriages and newborns consist of 27, 144 entries spanning the period 1999–2008. For the overlapping period (16 quarters of the 2005–2008 window), the values of 's match very well, which can be seen as a good test of the quality of the second sample, since the first dataset is not a sample.

The quantifiers we study as a function of are defined as (we use the symbol # to mean “number”):

We notice that they can be studied equivalently in Γ by the quadratic map of the interval into 0 ≤ Γ ≤ 1/4.

Results and Discussion

The four classical integration quantifiers just introduced describe some type of social coupling like the one between the employer and the employee in the job market, or between individuals in a marriage or parenthood. The reader can find in the Supplementary Material all the details to obtain the average points from the raw data, as well as the fitting procedure used to obtain the functional relation on Γ from the averaged data as shown in Fig. 1. For completeness we also show the quantifiers behavior versus the original parameter in Fig. 3.

Figure 3
figure 3

Upper-left panel, : Data are represented as spots (green) in each bin.

The black line is the best fit of the free theory outcome and yields the best value cF = 1.52 ± 0.05 with a coefficient of determination R2 = 0.985. Upper-right panel: : Data are represented as spots (yellow) in each bin. The black line is the fit of the free theory outcome with and yields the best value cF = 1.81 ± 0.09 with a coefficient of determination R2 = 0.963. Lower-left panel, : Data are represented as spots (blue) in each bin. The black line is the best fit of the free theory outcome and yields the best value cF = 1.18 ± 0.07 with a coefficient of determination R2 = 0.855. The blue line is the fit of the interacting theory outcome with with cI = 0.53 ± 0.02, and with a coefficient of determination R2 = 0.992. Lower-right panel, : Data are represented as spots (red) in each bin. The black line is the fit of the free theory outcome with and yields the best value cF = 0.64 ± 0.05 with a coefficient of determination R2 = 0.789. The red line is the fit of the interacting theory outcome with with cI = 0.28 ± 0.01, and with a coefficient of determination R2 = 0.984.

We now provide some statistical physics theoretical bases to explain the observed functional relations. The mathematical details of their derivation are given in the Supplementary Material.

Let us first discuss the job market. We start by observing that the number of employers is proportional to the number of natives. This can be seen by computing, in the period of time we study, the correlation between the size of the native population and the total number of employers from the data made available online by the Spanish Bank “La Caixa” (Anuario Economico de España), which turns out to be 0.98. On the other hand, the number of immigrant employees is proportional to . An elementary combinatorial computation (as well as a probabilistic one based on tossing Bernoulli coins20) predicts a frequency of jobs given to immigrants of the following type

where cF is some proportionality constant. This formula, which is linear in Γ(), provides a remarkably good fit to the employment data for both permanent and temporary jobs, as one can see from the upper panels of Figures 1 and 3. By giving each job position a two-valued random variable, since the job is given either to an immigrant or to a resident, one sees that the main feature of models predicting a similar behavior in is the assumption of the mutual independence of those random variables. In other words, on the macroscopic scale we are working on (i.e. the whole country), the likelihood of giving a job to an immigrant is indifferent to the fact that another job has been given to an immigrant or not. Such models within socioeconomic sciences, are all versions of the discrete choice theory24,25 originally devised to predict the use of public transportation but nowadays also used in other contexts such as occupations, residency locations, etcetera. That method, by suitably measuring the parameters of the theory (by polls or historical series), can yield quite accurate predictions as it did in the Bay Area Rapid Transit problem24. They are usually parameterized according to the logit (or multi-logit) probability distribution. Their predictive success is based on the fulfillment of the independence hypothesis. Potential threats for its validity are then peer-to-peer effects, belief propagation effects and in general all the situations where individual rational choice is perceived as difficult and people instead resort to imitation or anti-imitation of others. Within statistical mechanics, a theory of independent random variables is also called a free theory as opposed to an interacting one. The discrete choice theory is quite well-suited for policy-making for several reasons. First, it is based on empirical data and second, the utility function it is built on allows researchers to test concrete policy scenarios by varying the free parameters.

Coming to the other two quantifiers, the mixed marriages and newborns, one can see that the same type of free theory that provided an excellent fit for the data in the job market case is much less appropriate here (see Fig. 1). Indeed, the data show an anomalously high growth at small values of , largely underestimated by the free theory, followed by a crossover, after which the free theory overestimates the quantifier. The free theory makes a mistake by up to 30% of the quantifier.

The first thing to check is whether the different nature of the random variables can be responsible for the difference in social behavior. Social couplings, for instance in marriage, do fulfill the interaction rule of monogamy. That is, each individual is married only to one other individual, unlike employers who may have many employee or an employee who may hold more than one job. Similarly, the children of mixed couples do not on average exceed small units. All these are indeed forms of interaction, but a straightforward computation shows that, if no other interactions are also present, the same type of behavior of equation (8) still emerges for the probability of mixed marriages and newborns as discussed in the Statistical mechanics methodology section of the Supplementary Material.

From the lower panels of Figures 1 and 3, a different type of curve provides an impressively good fit, which is the square root of the main quantity i.e.

for a suitable proportionality constant cI. It is a well known fact that the square-root curve carries the fingerprint of the mean-field ferromagnetic theory of statistical mechanics21,22 that describes the imitative behavior of particles in dichotomic states. As explained in the Supplementary Material a vanishing Γc is compatible with a class of diluted networks that we consider in the proposed mathematical model. The match between theory and phenomenological data, beside being visually manifest, is quantitatively represented by the coefficients of determination reported in the captions of Fig. 3. We point out moreover that the interacting square-root behavior is clearly emerging from the data even without detailed information on the topology of the interaction network and proves the robustness of our findings. Neglecting the segregation or sub-community effects, possibly revealed by detailed topological information, may in fact lead, at worst, to underestimate the interaction effects.

A natural extension of our work would be to analyze the fluctuations around the average values. This has been done in the Supplementary Material only at a simple descriptive level. The main point still to be addressed is a quantitative match between fluctuation data and mathematical theory beyond their order of magnitude which is, roughly speaking, compatible with the noise generated by finite volume effects and the randomness of the social network structure. Moreover one should seek for more detailed data carrying also topological information in order to address segregation effects, as well as an increased detail beyond the bi-partite scheme immigrant-native to allow subsets of different nationalities and multipartite modeling. A natural test for our model is to analyse data from different countries. We believe that although the presence or the lack of the interacting regime may be different in different countries due to differences in cultures and regulations, the model proposed is general enough to adapt to different cases. In particular, the square root starting from the origin of the axes is somehow the expected emergence behaviour for the interacting cases. We plan to return on all those matters in future works.

To summarize our work we have analyzed a specific dataset of integration quantifiers in Spain and identified the empirical laws at growing immigration densities. Focusing on their average values on the national scale we found two types of growth and we have provided a simple theoretical framework for their interpretation. Our results could improve our ability to target integration policies since they provide an operative method to distinguish whether a macro phenomenon such as immigrant integration is the product of social action, as in the case of intermarriages and newborns with mixed parents, or the product of the common action of many people23, as in the labor market case. Our study shows the potential gain in introducing new families of mathematical models based on a statistical mechanics extension of discrete choice theory, since the latter offers a set of formal tools to systematically analyze and quantify socioeconomic situations.