Catalan is a Romance language evolved in the north-east of the Iberian Peninsula after the fall of the Roman Empire (Lleal, 1990; Vila i Moreno, 2008). This language was linked to the political institutions emerging in that region within the Crown of Aragón. This polity included some of the south departments of modern France and the prominent county of Barcelona, often center of the government of all these territories. The Crown of Aragón expanded southwards along the Mediterranean coastline (nowadays Valencia) and beyond the Iberian Peninsula bringing the Catalan language to the Balearic and other Mediterranean islands. Variants of this tongue are still spoken in Sardinia and southern France.

Castilian (nowadays Spanish) is a Romance language evolved in the north central part of the Iberian Peninsula at the same time as Catalan (Lleal, 1990). Castilian was linked to the political institutions of the region, notably the Crown of Castille, which extended southwards, like Aragón, as the struggle between the Christian and Islamic forces unfolded in the Iberian Peninsula during the Middle Ages. Similar historic stages can be seen in the spread of Galician-Portuguese along the Atlantic coast, while Castilian advanced across the inland territories. When the colonial era began, both Castilian and Portuguese diffused prominently through the American continent (Penny, 2002; Pharies, 2008).

Back in the Iberian Peninsula, the Portuguese and Castilian kingdoms remained different political entities in the long term. The Crowns of Aragón and Castille became unified forming the embryo of modern Spain (which involved also other regional kingdoms). These territories were not linguistically homogeneous, with some languages (e.g., Aragonese in the Crown of Aragón) reaching official status in some centers of government during the Middle Ages. Explicit political measurements were tried to mitigate the social split between the populations originating from the old, separate kingdoms, as well as the differences existing within each territory. These efforts had different degrees of success. For example, while Aragonese receded notably, the Catalan-Castilian linguistic division lasted through the creation of the Spanish nation-state. The existence of a vernacular language (Catalan) different from the official language of the Spanish kingdom (Castilian or Spanish), together with other cultural singularities, provided support for the emergence of regionalist and nationalist movements during the XIX and XX centuries in Catalonia and other Catalan-speaking regions (Llobera, 2003). After the Spanish Civil war (1936–1939), Catalan was repressed and banned from official communications (Guibernau, 2004). The restoration of democracy in Spain (symbolized by its modern constitution from 1978) awarded Catalan a co-official status in Catalonia. Different policies have ever since been articulated to protect Catalan from decline–these range from subsidized presence in the mass media to different strategies within the educative systems (Pradilla, 2001).

The past and future evolution of the Spanish and Catalan languages is tightly linked to the complex political scenario of modern Spain. Secessionist Catalan movements have grown steadily during the past decades (Centro d’Estudis d’Opinió, 2016), currently reaching a climax with an explicit agenda towards full independence formulated by the Catalan government. While language use does not determine a person’s position around the issue of independence, use of Catalan or Spanish does correlate with the adopted political position (Clua i Fainé, 2014; Woolard, 2008). Understanding and forecasting the evolution of the Catalan-Spanish tongues in Catalonia from a population dynamics perspective is, hence, a relevant political goal.

These dynamics become even more important from a sociolinguistic point of view. We worry about the survival and peaceful coexistence of linguistic communities. This is the mindset that we wish to adopt in this paper. Our approach is of relevance beyond the specific example that we deal with here. The total number of languages across the globe is reported to decline and the number of endangered tongues augments (Sutherland, 2003). While this is the overall context, here we focus on the particular Catalan-Spanish coexisting dynamics. Abundant, curated data is available for this system. Also, as argued above, the ongoing political scenario makes it a very appealing study case.

The mathematical characterization of population dynamics is well rooted within the field of ecology (Kot, 2001; Turchin, 2003). Recent works have extended this kind of analysis to different social aspects, including the evolution of speakers of different coexisting languages (Abrams and Strogatz, 2003; Baggs and Freedman, 1990, 1993; Castellano et al., 2009; Castelló et al., 2013; Kandler and Steele, 2008; Mira and Paredes, 2005; Patriarca and Heinsalu, 2009; Zhang and Gong, 2013). The approach is based on sets of differential equations able to reconstruct historical series of data and, hopefully, make informative predictions.

The reconstruction part of this problem has been successful in various scenarios. Relevant examples are the modeling of up to \(42\) cases of language coexistence by Abrams and Strogatz (Abrams and Strogatz, 2003). One of the cases studied was that of Sottish Gaelic. This was further investigated in its full complexity (which implies language dynamics across different territories) by Kandler et al. (2010). Besides accounting for historical data, these authors make a first effort in prediction in an uncertain political environment. This is a scenario similar to ours. Forecasting in the face of unsettled (political) struggles is an ambitious goal that also calls for a warning: there is a limit to the contingencies that our mathematical models can account for, and hence no prediction can be taken as definitive. This is also an opportunity for the valuable interplay between theory, its predictions, and its prospective failures; to improve our models.

We based our work on a vast amount of official data of language use collected by the Institut d’Estadística de Catalunya (Catalan Institute of Statistics) during the last decades (Generalitat de Catalunya, 2004, 2011, 2013, 2015a, 2015b; Torres et al., 2005). These are very rich data sets with more details available than our models can account for. Hence, important simplifications and preprocessing of the data (described below) were necessary. Ultimately, we modeled the data according to the proposal by Mira and Paredes (2005). This relies on a system of differential equations that track over time two monolingual populations along a bilingual one. The stability of this model has been characterized during the last years (Colucci et al., 2016; Mira et al., 2011; Otero-Espinar et al., 2013; Seoane and Mira, 2017), which makes it a powerful tool for our analyses. However, we wish to invite future contributions to examine the same or similar data using alternative equations as well.

Materials and methods

Data sources and pre-processing

The original data were gathered by the Institut d’Estadística de Catalunya (Idescat, Catalan Institute of Statistics) in the EULP (Enquesta d’Usos Lingüístics de la Població, Survey of Language Use by the Population) surveys (Generalitat de Catalunya, 2004, 2011, 2013, 2015a, 2015b; Torres et al., 2005), which have been conducted every five years since \(2003\). The relevant item for us is the self-assessment of language use, in which the respondents reported as a percentage their daily use of each language as detailed below. This item is absent in the \(2003\) studies (Generalitat de Catalunya, 2004; Torres et al., 2005), so we only used the \(2008\) and \(2013\) editions (Generalitat de Catalunya, 2011, 2013, 2015a, 2015b).

Each speaker self-assessed, as a percentage, her daily use of each of the tongues that she speaks. These included Catalan and Spanish together with Galician, Arabic, Urdu, and many others arising from different migratory background. We want to focus on the Catalan-Spanish dynamics alone. We removed all speakers that used any other language more than \(30 \%\) of the time. For the speakers retained, we discarded the other languages reported and normalized the data such that Catalan plus Spanish add up to \(100 \%\). The respondents were stratified in \(5\)-year intervals to generate a table (Fig. 1a) in which each column is associated to the average date in which respondents were born (spanning from \(1910\) to \(1990\) in EULP-2008 and from \(1915\) to \(1995\) in EULP-2013) and each row to the percentage of Catalan use that they reported. Hence, each entry of the table contains the estimated number of people in Catalonia of the same age that would report a same percentage of Catalan use. We treat this age-stratified data as a proxy about the proportion of Catalan speakers at the time that each group was born. This is inspired by other apparent-time studies (Chambers, 2013; Eckert, 1997; Labov, 1963; Magué, 2006). This approximation is not free of criticism and has been more often used to study variants of a same language, but it is a suitable way to generate a temporal series from the available data.

Fig. 1
figure 1

From data to theory. a A sample of Catalan citizens classified their daily usage of Catalan within \(0-100 \%\) (correspondingly, of Spanish within \(100-0 \%\)). By setting a bilingualism threshold \(r\), we accept as bilinguals those individuals using both languages a percentage greater than \(r\) (in this figure we chose \(r=38\) arbitrarily). They occupy the central rows in the table in panel a. Speakers declaring to use one of the languages a percentage less than \(r\) were classified as monolinguals (top rows of the table for Spanish monolinguals, with Catalan usage below \(38 \%\); and bottom rows for Catalan monolinguals). b Speakers have also been aggregated in five-year intervals according to their year of birth. With the fractions \(x\) (blue crosses, Spanish monolinguals), \(b\) (green circles, bilinguals) and \(y\) (red plus symbols, Catalan monolinguals) of speakers in each linguistic group and their average years of birth we reconstruct time series that we will fit to our model equations. Two different surveys are available (each of these series is represented by smaller symbols in panel b). We can average these data (larger symbols) to create more robust estimators of the fractions of speakers. For each \(r\) a different data series is created. Instead of considering a rigid definition of bilingualism (whose existence we do not debate), we conducted our analysis with all possible integer prescriptions (\(r\in [1,50]\)). c Example fit to the data for \(r=10\). It predicts a majority of bilingual speakers in the long term with small, similarly sized monolingual groups. This is consistent with the broad definition of bilingualism that \(r=10\) implies

In Catalonia more than half of the population concentrates in Barcelona and its metropolitan area, which may arguably have different dynamics from the rest of the region. Accordingly, after studying the population dynamics inferred from global time series, we repeated the analyses for these two segregated regions (Barcelona plus metropolitan area vs. rest of the territory).

In every case, a table similar to that in Fig. 1a constitutes our raw data. Several models discussed in the literature (Baggs and Freedman, 1990, 1993; Heinsalu et al., 2014; Minett and Wang, 2008; Mira and Paredes, 2005; Zhang and Gong, 2013) coarse grain language use into three broad categories: two monolinguals and a bilingual one. The model chosen for our analysis (Mira and Paredes, 2005) (see below) does so. We imposed these divisions on our raw data by defining a bilingualism threshold \(r\) beyond which a speaker would be considered bilingual. For example, setting \(r=20\) every speaker who uses Catalan more than \(80 \%\) of time during a day is considered a Catalan monolingual, every speaker who uses Catalan less than \(20 \%\) is considered a Spanish monolingual, and every speaker employing both Catalan and Spanish within \(20-80 \%\) is considered bilingual. There is not a clear criterion about what threshold to use, so we conducted our analyses for all possible integer values of \(r\in [1,50]\). This covers all cases from the extreme in which anyone employing both languages is considered bilingual, to the stringent situation in which only those using both tongues half of the time score as non-monolinguals.

For each value of the bilingualism threshold we derive a full data series with a proportion \(x\) of Spanish monolinguals, a proportion \(b\) of bilinguals, and a proportion \(y\) of Catalan monolinguals over time (Fig. 1b). We built these data series for each EULP survey and combinations of them. Most analysis were conducted on all available data; here we present results for the more robust (less noisy) time series. (See Appendix S2 for details and to explore all existing results, which are consistent throughout).


To characterize our data we use the model by Mira and Paredes (2005), Mira et al. (2011), Otero-Espinar et al. (2013), Seoane and Mira (2017) which considers the existence of \(X\) (in this case Spanish) and \(Y\) (Catalan) monolingual groups and a bilingual group \(B\). These groups present fractions \(x\), \(y\), and \(b\) of speakers respectively within a normalized population (\(x+b+y=1\)). The model assumes that the probability that monolingual speakers acquire the opposite language is proportional to the prestige (\({s}_{X}\) or \({s}_{Y}\)) of the other language and to the population speaking that other tongue. It is taken \({s}_{X},{s}_{Y}\in [0,1]\) and \({s}_{X}+{s}_{Y}=1\) so we can focus on \(s\equiv {s}_{X}\). Of all speakers acquiring a new tongue, a fraction \(k\) of them retains the old one (hence becoming bilinguals) while \(1-k\) of them switch and forget. The parameter \(k\) is termed interlinguistic similarity (Mira and Paredes, 2005; Mira et al., 2011) and measures how close is the couple of languages as perceived by the population. The probabilities of leaving or entering each group (\(X\), \(Y\), or \(B\)) result in a set of differential equations that tell us the time evolution of the linguistic population:

$$\begin{array}{ccc}\frac{{\mathrm{d}}x}{{\mathrm{d}}t} = c\left[\right.(b+y)(1-k)s{(1-y)}^{a}\\ -x\left((1-k)(1-s){(1-x)}^{a}+k(1-s){(1-x)}^{a}\right)\left]\right.,\\ \frac{{\mathrm{d}}y}{{\mathrm{d}}t} = c\left[\right.(b+x)(1-k)(1-s){(1-x)}^{a}\\ -y\left((1-k)s{(1-y)}^{a}+ks{(1-y)}^{a}\right)\left]\right..\end{array}$$

Only two equations are needed thanks to the normalized population. Also, this is a compact version of the system. For detailed discussion of what the different parameters mean in practical terms, see Supporting Information (5) or extensive discussions in the literature by different authors (Mira and Paredes, 2005; Mira et al., 2011; Otero-Espinar et al., 2013; Seoane and Mira, 2017). The parameter \(a\) (which has been referred to as volatility (Castelló et al., 2013)) affects those speakers that promote language shift (termed attracting population (Heinsalu et al., 2014; Seoane and Mira, 2017)). It confers an idea of how persistent the linguistic groups are: the lower \(a\) the easier it is for all groups to lose speakers, thus rendering the system more volatile (Colucci et al., 2016).

The stability of this model has been thoroughly characterized as a function of its parameters (Colucci et al., 2016; Mira et al., 2011; Otero-Espinar et al., 2013; Seoane and Mira, 2017). If \(a\,>\,1\), stable solutions include scenarios in which (i) the bilinguals and either one of the monolingual groups get extinct and (ii) both monolingual groups survive along a bilingual group. Coexistence is usually reached for larger \(k\) and relatively balanced prestiges \({s}_{X} \sim {s}_{Y}\).

Equation (1) are a generalization of the seminal Abrams-Strogatz model (Abrams and Strogatz, 2003) that promoted non-linear differential equations for the study of language population dynamics (even if earlier, similar approaches existed (Baggs and Freedman, 1990, 1993)). The original equations did not include bilingualism on the grounds that it played a minor role for the languages under research. This is not the case in the Catalan-Spanish coexistence scenario.

Other valuable models consider bilingual situations (Baggs and Freedman, 1990, 1993; Heinsalu et al., 2014; Minett and Wang, 2008; Zhang and Gong, 2013). Besides our familiarity with the chosen system of equations, the stability of the alternatives has not always been studied. Some of these models do not contemplate stable, coexisting languages (Minett and Wang, 2008) or do so only after alternative parameterizations are included (Baggs and Freedman, 1993). It is intensely debated whether languages can coexist steadily in an asymptotic time, but it does not seem appropriate to barren that possibility beforehand. Hence, we decided to conduct our analysis with equations that allow this scenario explicitly.

This model consists of two coupled, non-lineal differential equations with \(4\) parameters \(\{a,c,k,s\}\) and two initial conditions (\(x(t={t}^{0})\), \(y(t={t}^{0})\)). To extract these parameters from the data we followed the fitting procedure described in Appendix S1, which basically makes a fast, heuristic least square minimization. Also in Appendix S1 we compare the best and worst fits and provide plots of the fits from all data series and for all bilingualism thresholds. One example of a good fit is that obtained for \(r=10\), shown in Fig. 1c along with its predictions towards the end of the XXI century.


The most important result that we extract is that the Catalan-Spanish sociolinguistic system tends, under most circumstances analyzed, to a stable state in which both languages coexist. The data also reveals a few counterintuitive insights that we examine in the next subsections. The discussion concerns mainly the parameters \(k\) and \(s\) extracted from adjusting the model equations to the different datasets. We can always track down the stability of the system to these two parameters and the initial conditions. The other parameters in Eqs. (1) (\(a\) and \(c\)) are not so determinant regarding the stability. Their trends as a function of the bilingualism threshold are discussed in Appendix S3.

Stability of the Catalan-Spanish system

The most relevant parameters of the model are the interlinguistic similarity (\(k\)) and prestige (\(s\)) which have intuitive interpretations owing to their roles in Eqs. (1). Thanks to previous studies of the model (Colucci et al., 2016; Mira et al., 2011; Otero-Espinar et al., 2013) we know how to link these parameters to the stability of the system. Figure 2a–d show how the \(k-s\) plane is divided into two regions: a gray area where coexistence is possible (depending on the initial conditions) and a white area where coexistence is never possible. For these plots, \(a=1.31\) (a value inherited from the original Abrams-Strogatz studies (Abrams and Strogatz, 2003)). For other values of \(a \ge 1\) a similar division of the plane happens (Otero-Espinar et al., 2013), and values of \((k,s)\) exist for which the dynamics are equally well explained. We could have chosen any arbitrary value \(a\ge 1\) without losing explanatory power. Comparisons between \((k,s)\) values only make sense if \(a\) is fixed. Hence, to better illustrate the results, we performed our analyses both allowing \(a\) to vary and keeping it fixed at \(a=1.31\). Similar conclusions are reached in both cases (see Appendix S3), but we focus on the fixed case now.

Fig. 2
figure 2

Stability analysis. Mapping measured parameters into the \(k-s\) plane for a the whole Catalan territory, b Barcelona and metropolitan area, and c not Barcelona. Falling outside the shaded region implies that one of the languages will go extinct (shaded region is consistent with, but does not imply, sustained coexistence). Each point represents average \((k,s)\) from the fits with a given \(r\). Error bars represent one standard deviation. a For every \(r\ge 23\) the parameters imply the extinction of one of the languages. The slightly larger prestige for Spanish (\({s}_{S}\simeq 0.57\) versus \({s}_{C}\simeq 0.43\) for Catalan–both are averages across all values of \(r\)) suggests a larger survival outlook for this tongue. b Within Barcelona both prestiges are fairly balanced (\({s}_{S}\simeq 0.51\) versus \({s}_{C}\simeq 0.49\)). For \(r\ge 31\) one of the languages is predicted to go extinct, but the data and the model are often compatible with the extinction of either language. c Spanish presents a larger prestige in the rest of the Catalan territory (\({s}_{S}\simeq 0.6\) versus \({s}_{C}\simeq 0.4\)). d For each \(r\), arrows start in the average \((k,s)\) for Barcelona and end in the corresponding point obtained outside Barcelona. They indicate how the rural Catalan areas fall into an extinction course more easily (for lower values of \(r\)). e The origin of all arrows has been shifted to \((0,0)\). We appreciate that \(k\) is always notably smaller for speakers outside Barcelona. This indicates that, outside Barcelona, Catalan and Spanish are perceived as more different from each other

For each data set and each bilingualism threshold we derived several collections of parameters compatible with the corresponding time series. The existence of several good fits in each case allows us to perform a statistic analysis that bootstraps the variability of the parameters. Figure 2a shows average and standard deviations for \(k\) and \(s\) for the different integer values of \(r\in [1,50]\). More generous definitions of bilingualism are explained by models with larger interlinguistic similarity–i.e., smaller values of \(r\) lay at the right side of the \(k-s\) plane and larger values of \(r\) lay towards the middle-left. This range of \(k\) values is explained by the differences in the bilingualism threshold \(r\). Unluckily, the model and data available cannot offer a rigid constrain on the interlinguistic similarity. We note, though, that in average it does not become arbitrarily low–not even for the most restrictive definitions of bilingualism in which only speakers using \(50 \%\) of the time each language are considered bilinguals.

Our results also indicate that the prestige of Spanish (\(s \sim 0.57\)) is consistently slightly larger than that of Catalan (\(1-s \sim 0.43\)) for any value of \(r\). These numbers are sustained throughout the whole range \(r\in [1,50]\), strongly suggesting that this is a good descriptor of the Catalan-Spanish system given the data and the model.

Roughly half of the \(r\) values allow for asymptotic stability (up to \(r\le 22\)) and another half (\(r\ge 23\)) strictly ban it. However, for larger values of \(r\) the prestige of both languages becomes more leveled (\(s\) is only slightly above \(0.5\) for larger \(r\)) and the fits become less consistent: some of them predict the extinction of one tongue and some others predict the extinction of the other one–both usually in an asymptotic time much larger than \(100\) years. This indeterminacy comes about because the system sits near a bifurcation point and the data is not enough to clarify the outcome of the competition dynamics.

Our stability analysis is complemented with an attempt at prediction towards \(2030\). Such predictions must be taken with all the prudence possible: they are the results obtained with this model for the data available, and the seemingly open-ended nature of human dynamics does not allow us to have an all comprehensive understanding of the situation. However, some qualitative results outlined below are fairly consistent across datasets, fitting setups, and definitions of bilingualism (through the parameter \(r\)). This invites us to be confident about the general conclusions.

We registered the percentages of Spanish (Fig. 3a), Bilingual (Fig. 3b), and Catalan (Fig. 3c) speakers predicted by the model for the year \(2030\) for each of the combinations of parameters derived for each time series. These year-\(2030\) predictions were binned in intervals comprising \(0.05\) increments in the fraction of speakers. Figure 3a–c shows how, consistently, our results indicate a middle-term coexistence between Spanish and Catalan, even for those configurations of parameters that imply the eventual extinction of one of the tongues. These extinction scenarios happen with stringent definitions of bilingualism (large \(r\)) and present relatively balanced prestiges (\({s}_{S} \sim 0.5 \sim {s}_{C}\)) indicating that even if one language must die eventually, that result will only happen asymptotically and coexistence could perhaps be granted for several generations.

Fig. 3
figure 3

Model projections for \(2030\). For each set of parameters obtained after fitting the data, Eqs. (1) were evolved into the future until \(2030\). The fraction of speakers within each group (a, Spanish monolinguals; b, bilinguals; and c, Catalan monolinguals) was then binned in \(0.05\) intervals. The contributions to these intervals were weighted as indicated in the Supporting Information (1), so that fits with larger residua contribute less than the more accurate ones. The gray scale indicates the likelihood that a set of parameters would evolve into each \(0.05\) bin for a given \(r\) in \(2030\). For reference, blue, green, and red lines indicate respectively the fraction of Spanish, bilingual, and Catalan speakers in \(1990\), the last data point in the EULP’ series. d Average (with standard deviation) change in the fraction of speakers for each group for each \(r\) value

Notwithstanding the bilingualism threshold, the model always predicts a less important role for Catalan language in a middle-term future. Catalan speakers towards \(2030\) would always amount to less than Spanish and, more often than not, bilingual speakers. Meanwhile, Spanish stands as the dominating language for some definitions of bilingualism, and it is consistently the group projected to grow the most until 2030. Figure 3d shows the expected gain of speakers for each group in \(2030\), with Spanish standing out. For \(r\,<\,30\), Spanish is expected to win most of its new speakers from bilinguals. For \(r\,>\,30\) both bilinguals and Catalan monolinguals would lose a substantial amount of speakers to Spanish.

Analysis across different geographical areas

The linguistic map of Catalonia contains two very distinguishable regions: On the one hand, Barcelona and its metropolitan area constitute the second largest urban hub in Spain and agglutinates more than \(70 \%\) of the Catalan population. This is home to large migrant groups from the rest of Spain and elsewhere (notably Pakistan and China), while Barcelona itself is a very cosmopolitan city attracting large masses of tourists. On the other hand, the rest of Catalonia (while still containing notable urban areas and some regions with large migrant populations) has a more rural character and is spread across larger territories.

To further understand the linguistic reality of the system we segregated the data in those two broad geographical regions and repeated our analyses. We found similar tendencies in the parameters \(a\) and \(c\). The interlinguistic similarity again drops as the definition of bilingualism becomes more stringent (Fig. 2b, c), again showing the limitations of the model and existing data to constrain the value of \(k\). However, note once again that the interlinguistic similarity never drops to zero (not even for the most stringent definitions of bilingualism); and note also how it peaks at roughly \(k=0.9\) in non-urban areas (Fig. 2c), while it comes much closer to \(k=1\) in the Barcelona area, thus stressing the difference that population outside Barcelona always perceives between both languages. On the other hand, average values of prestige (\(s\)) are broadly consistent throughout \(r\in [1,50]\), suggesting that the model and data together are able to constrain this characteristic of the Spanish-Catalan dynamics.

The analysis of the parameter \(s\) offers a counterintuitive result: In general, Spanish presents a lower prestige in Barcelona and its metropolitan area than in the rest of Catalonia. Note that Catalan is less spoken in Barcelona: Over the last \(100\) years it never had more speakers than Spanish (Supporting Fig. 4), while the rest of Catalonia does present a larger body of monolingual Catalan speakers (Supporting Fig. 5). This would naively suggest that the perceived prestige of Catalan is lower in the urban metropolis (hence Spanish prestige would be higher), but our analysis indicates exactly the opposite: rural areas (where Spanish is less spoken) perceive Spanish as a more prestigious tongue. The catch is that while the decay of Spanish and Catalan in the Barcelona area has been roughly symmetric over the last century and favors a strong bilingual group (Supporting Fig. 4a); in regions outside Barcelona, Spanish speakers have remained relatively constant across time while the larger Catalan group has been decaying in favor of bilinguals (Supporting Fig. 5a). In Eqs. (1), a large prestige captures precisely the ability of a smaller group to make a larger (and originally stronger) one decline.

Figure 2d–e summarize the differences between the perception that the two geographical areas have about Catalan and Spanish. The arrows in Fig. 2d connect the averages \(k\) and \(s\) in Barcelona with the averages outside Barcelona. These arrows are replotted with their origin in \((0,0)\) (Fig. 2e) to indicate how speakers outside Barcelona not only assign a slightly lower status to Catalan, but also they perceive both languages as more different–as indicated by the \(\sim 0.1\) drop in interlinguistic similarity consistent across similar definitions of bilingualism.

As we did for the aggregated data, we complemented the stability analysis by projecting the evolution of the model into the future until \(2030\). Again, both for the metropolitan and non-metropolitan areas, most configurations of the model are compatible with middle-term coexistence of the tongues. When the definition of bilingualism is more rigorous (larger \(r\)) the asymmetries (either in prestige or initial conditions) become relevant and, in some cases, are capable of substantial gains and losses in number of speakers within the projected time. Within Barcelona, Catalan would be the endangered language (Supporting Fig. 6). For almost every definition of bilingualism, Spanish would draw most of its new speakers right away from Catalan monolinguals (Supporting Fig. 6d).

Counter intuitively again, outside the metropolis, for large \(r\) Catalan would be able to gain a large number of speakers from Spanish (Supporting Fig. 7). Despite the larger prestige of Spanish in those regions, this would be possible due to the still large Catalan monolingual population outside the Barcelona area. For very low \(r\) in regions outside Barcelona, the large monolingual support of Catalan would fade as the projections predict a larger presence of Spanish monolingual speakers. For intermediate \(r\), Catalan would grow notably, but extracting speakers from the bilingual group rather than from the Spanish monolinguals (Supporting Fig. 7d).


In this paper we analyzed the system of Spanish and Catalan coexistence using recent and thorough data surveys of language use and up-to-date models based on non-linear equations. There is a gap between the theoretical developments and the empirical data available (Seoane and Mira, 2017). The former often rely on concepts (e.g., bilingual speakers) that have a clear definition within the model but that are difficult to pin down empirically. Furthermore, definitions of bilingualism within such mathematical model might differ from those in other linguistic subfields. Given the complex and subjective nature of the problem under research it is necessary to rely on the self-assessment of linguistic qualities–in this case, percentage of language use.

We wanted to conduct our analysis in the most general way possible given the data. We considered a series of bilingualism thresholds (encoded by \(r\in [1,50]\)) and did not assume that any of these thresholds constitutes the right definition of bilingualism. Instead, we performed our analysis for all possible scenarios. This could potentially produce a wealth of models with antagonistic predictions, hence frustrating any robust conclusion. This happens often in complex systems that sit midway between competing forces–which is the case here. Instead, our analysis renders a consistent picture across different data sources and for most different definitions of the bilingual group. This picture is that of a long term coexistence between Spanish and Catalan in Catalonia, always along a bilingual group, and with a dominating role for Spanish while the group of monolingual Catalan speakers declines lightly.

Further details hinge on the bilingualism threshold employed. For roughly half of the definitions of bilingualism (\(r\le 22\)) it is predicted that both languages will survive. The monolingual Catalan group is expected to be smaller than the Spanish one towards 2030 (Fig. 3), disregarding of \(r\). Most of the population would be bilingual for \(r\le 22\). For very stringent definitions of bilingualism (\(r\,>\,40\)) some of the models predict a huge loss of Catalan speakers before 2030, but coexistence between both languages is still the most common outcome. Similar results are obtained when segregating the data between Barcelona plus metropolitan area versus rest of Catalonia. Coexistence of both tongues remains the most persistent outcome, but for strict definitions of bilingualism (\(r\,>\,40\)) large losses of Catalan speakers in Barcelona and of Spanish speakers outside Barcelona become likely within a few decades.

These mid-term predictions leave considerable room for action. Consequently, they are also daring and should be subjected to continuous revision as new data becomes available. The correctness of our analyses depends on some assumptions: (i) The data collected so far is reliable and significant about how the situation might evolve. Consequently, (ii) social and political circumstances should not vary considerably in the future. Unluckily there are no studies about how notable political events affect the smooth dynamics of the system. (This is also true for all other models of language shift (Abrams and Strogatz, 2003; Baggs and Freedman, 1990, 1993; Castellano et al., 2009; Castelló et al., 2013; Kandler and Steele, 2008; Mira and Paredes, 2005; Patriarca and Heinsalu, 2009; Zhang and Gong, 2013). Should the socio-political stage change drastically (e.g., if Catalonia would become an independent state, a possibility debated nowadays), our analysis might become outdated. (iii) We never assume that we are using the right theory. Our equations might not be correct, so indeed this exercise should help us validate the model–even if some predictions lay far in the future.

Given this room for action that our analysis suggests, a result of our study that would be relevant for language planning and policy making (Fishman, 2013) is the connection between Catalan’s survival and the growth of bilingualism. This seems to contradict other studies from sociolinguistics, which posit that bilingualism is a first step towards language loss as intergenerational transmission of one of the languages becomes weaker. Actually, our model is compatible with the later situation too. See, e.g., Supplemental Fig. 1b4, in which the raise of bilingualism is linked to a definitive decline of Catalan usage. In other words, the maths of population dynamics seem compatible with bilingual groups that grow to later coexist with both monolinguals (e.g., Supplemental Fig. 1a4), and with bilingual groups that act as a way out of one monolingual option, then progressing towards full language shift. In our work, the data constrains the model parameters, thus choosing the dynamics that better explain the observed evolutions. Attending to this, the raise of a bilingual group might not be an infallible indicator of future language decline. However, our results suggest another potentially predictive observable: a marked asymmetry between the time evolution of both monolingual groups. This seems consistent with the corresponding asymmetry between the intergenerational transmission of both languages discussed in the literature (Fishman, 2013), and seems to better underpin the causality behind language shift.

In this paper we also quantified the perceived prestige of both languages and their interlinguistic similarity. The former cannot be hugely constrained by the model since we lack a definitive measure of bilingualism and, for this dataset, \(k\) changes widely with \(r\). Results for the prestige parameter were relatively consistent across \(r\in [1,50]\) suggesting that we have successfully captured \(s\) for the languages involved. Also, spatially segregated data reveals interesting differences across regions – notably the higher prestige of Spanish in the areas that, historically, had more Catalan speakers. In those regions also the perceived difference across languages is larger. Both these observations hold despite the variation of \(r\), strongly suggesting that they are real features of the system.

In the estimation of these parameters we assume that all speakers perceive both languages similarly to each other, as an average person. This is of course unrealistic, and it hints us at an important factor in the fate of systems of coexisting languages. Different speakers might perceive languages differently, especially so if they come from different social of geographic backgrounds (notably including migrating populations). This affects not only the prestige: a speaker that is exposed to a second language later in life might also perceive a bigger learning challenge, thus sensing a lower \(k\). The reasons for differing prestiges might be politically motivated, thus might result of paramount importance for the system that we study here. Extending the mathematical model to include these effects is beyond the scope of this paper, but we think that it is a pressing task, specially as the world becomes more interconnected.

It could be thought that an objection to our model and others similar is that the parameters are abstract and difficult to relate to more concrete features. Notwithstanding our abstractions, the parameters \(s\) and \(k\) have definite causal consequences in terms of population flows in our equations. Both steady states and the dynamical unfolding of the equations are intimately linked to their numerical values. Measures under different circumstances (e.g., values of \(r\), or geographically stratified data) can be compared to each other and sound conclusions can be extracted. In this sense, both \(k\), \(s\), and other parameters carry meaningful information about the Catalan-Spanish system. We assess these quantities indirectly (by fitting the data to our model), but perceived prestige or similarity between tongues and other sociolinguistic characteristics can be directly reported in future surveys. These more concrete quantities can then be correlated to our parameters, thus helping us bridge the gap between theory and empirical data in the social sciences. For example, we can test empirically whether people from Barcelona report a higher similarity between Catalan and Spanish than people elsewhere in Catalonia, as predicted by our results.

This model has been compared to data of speakers over time for another system of coexisting languages (Galician and Spanish in northwest Spain), albeit only as an illustration and with less conclusive results (Mira and Paredes, 2005; Mira et al., 2011; Seoane and Mira, 2017). Those studies suggest that Galician language might sit in a tighter crossroads than Catalan, but those two studies seem not to be consistent with each other, arguably because of defective data in (Mira and Paredes, 2005) (data not collected by the authors, nevertheless). More rigorous analyses with this same model (similar, indeed, to the ones presented here) and better curated data have helped detect a potential difference in how speakers switch language options in rural versus urban setups (Juane et al., 2019). While the Barcelona area investigated in this paper is prominently urban, the rest of the Catalan territory does include a varied mixture of large cities and very rural areas. Hence, to locate similar effects in the current system we would need more geographically refined data. Other models have been used to study other systems of coexisting languages. The original work that largely prompted the population dynamics approach to language shift studied \(42\) cases in which arguably different languages coexisted and, consistently, one of them declined (Abrams and Strogatz, 2003). Since these languages are so different to each other, it could be argued that we would measure low \(k\) if studied with the model in this paper. This model, indeed, recovers the one from (Abrams and Strogatz, 2003) as \(k\to 0\) and bilinguals decline. An impressive work has been made in the case of Scottish-English coexistence in Scotland (Kandler et al., 2010), where trends were analyzed in time and also in small geographic units using reaction diffusion equations. These results suggest that Scottish progresses towards extinction. However, this work is influenced by the model of (Minett and Wang, 2008) in a critical choice of model parameters that renders the equations incapable of detecting stable coexisting solutions if they existed. This does not mean, however, that the conclusions are not sound and relevant. All these and other systems of coexisting languages constitute enticing test cases to confront models to each other so that we can select the best working ones and reject others. We look forward to such a work.

Mathematical models of language contact situations give us hints about the important factors that could reverse the current predictions–notably, the perception of bilingualism and the geographic distribution of the population (Seoane and Mira, 2017). The former is clear from our analysis and has been discussed theoretically by Heinsalu et al. (2014). A key for stability is thus bolstering a strong bilingual group capable of capturing speakers faster than any monolingual group. To achieve this, it is relevant that bilingual individuals reach a preponderant role within their society. Failing to establish a lasting bilingual group guarantees that the competition will result in an extinct language; and the most likely scenario, given the data, would be the decline of Catalan.