## Introduction

Human populations are highly mobile in the modern world, and migration is one of the main factors that determines changes in population size, distribution, and structure (Abel and Sander, 2014; Agliari et al. 2018). As migration impacts the demographic and socio-economic aspects of a country, it has become one of the most challenging issues confronting policymakers for nations around the world (International Organization for Migration, 2017a, c). Understanding internal migration, which is normally substantially larger than international migration rates, and their changes over time is critical for keeping subnational population numbers up-to-date (Frayne, 2005; Pendleton et al. 2014; Wardrop et al. 2018). Contemporary data on internal migration flows are valuable for urban planning, resource allocation, infrastructure development, public service provision, and impact assessments. For instance, identifying where people migrate internally is often vital in development work, as migrants might be marginalized and at higher risk due to a lack of resources to meet demands (Lu et al. 2012; Lu et al. 2016; Ruktanonchai et al. 2016a). However, our knowledge of contemporary internal migration patterns remains poor for many countries (Garcia et al. 2015; Sorichetta et al. 2016; International Organization for Migration, 2017b), and is difficult to update between data collections for the majority of countries around the world.

Data collected from traditional sources, such as national population and housing censuses and household surveys, are the primary source for migration statistics (International Organization for Migration, 2018). Within population and housing censuses, migration is typically measured through a change in residence over a 1-year or 5-year period prior to the census. The increasing use of global positioning systems (GPS) has supported the collection of more spatially precise data, but each census only provides a single snapshot of migration flows, commonly once every decade, and migration patterns typically change over time between censuses or surveys (Namibia Statistics Agency, 2013; Wesolowski et al. 2013). Moreover, surveys only sample a small proportion of population, and the logistical challenge of censuses makes them an infrequent and expensive source of demographic data (Wardrop et al. 2018).

Moreover, as migration is anticipated to continue to rise, both in terms of volume and reach, the need for timely updates to demographic statistics and inform migration policy development increases–a need that traditional sources are typically not well-equipped to meet (International Organization for Migration, 2018). To predict contemporary migration for many countries, a growing interest in the modeling of migration flows emerged, leading to the advanced development of modeling methodologies to estimate migration rates (Courgeau, 1995; Henry et al. 2003; Cohen et al. 2008; Abel, 2013; Abel and Sander, 2014; Garcia et al. 2015; Sorichetta et al. 2016; Vobruba et al. 2016). However, regardless of how sophisticated these methods are, these estimates remain largely constrained by the lack of contemporary input data and often their coarse spatiotemporal resolution (Garcia et al. 2015; Sorichetta et al. 2016).

Call detail records (CDRs) routinely collected by mobile phone operators for billing purposes are particularly promising for analyzing migration-related phenomena and a potential solution to existing data gaps (International Organization for Migration, 2018). CDRs contain an entry for each call or text (or other billable event) made or received by any anonymous user, together with the date and time of each communication and an identifier for the tower that the communication was routed through within the operator’s network (Ruktanonchai et al. 2016b; Zu Erbach-Schoenberg et al. 2016). Then the tower-level location of each communication can be identified, and from this, spatially and temporarily explicit estimates of human mobility, which can be derived from anonymised CDRs from the movement of individual mobile user between different communications. These data have been increasingly used for quantifying short-term human mobility, mapping dynamically changing population densities, estimating infectious disease spread risk, and measuring population displacements due to disasters and conflicts (Lu et al. 2012; Wesolowski et al. 2012; Deville et al. 2014; Tatem et al. 2014; Wesolowski et al. 2014a; Wesolowski et al. 2015a; Wesolowski et al. 2015c; Lu et al. 2016; Ruktanonchai et al. 2016b; Zu Erbach-Schoenberg et al. 2016; Wesolowski et al. 2017). Moreover, previous work on defining overall and seasonal patterns of population movement using CDRs suggested they could also be used to model internal migration (Blumenstock, 2012; Wesolowski et al. 2013; Ruktanonchai et al. 2016a; Wesolowski et al. 2017).

In previous studies, however, CDRs frequently spanned much shorter periods than one year, or multi-year mobility analysis using CDRs have been presented, but no studies have compared individual places of usual residence across different years to estimate migration flows by matching the definition of migration used in censuses (Blumenstock, 2012; Zu Erbach-Schoenberg et al. 2016; Wesolowski et al. 2017). Based on a multiannual CDR dataset in Namibia, for the first time, we assess how CDRs as a novel data source might be used efficiently and accurately to replicate the internal migration statistics produced in a census, and examine how CDRs could improve the estimates made using classical gravity models. This study also reveals otherwise unmeasurable year-by-year migration patterns to assess the potential of CDRs for updating internal migration statistics.

## Datasets

### Census migration statistics

The most recent census in Namibia was conducted in 2011, and we obtained the internal migration statistics between regions from a census-based migration report published by the Namibia Statistics Agency in 2015 (Namibia Statistics Agency, 2015). To derive the 1-year internal migration statistics, the census (with a reference night of 28 August 2011) asked about each individual’s place of usual residence (where does the person usually live?) and the place of previous residence (where did the person usually live since September 2010?). The place of residence refers to the location where a person usually lives for the majority part of any year (at least six months). An individual was considered as an internal migrant if the regions of usual residence and previous residence did not match in the 2011 census.

### CDR-derived flow data

To assess whether mobile network data could produce comparable migration statistics, we obtained a large dataset of anonymized 72 billion CDRs between October 2010 and April 2014 from Mobile Telecommunications Limited (MTC) (Mobile Telecommunications, 2018) (Mobile Telecommunications, 2018) (Mobile Telecommunications, 2018). MTC is the leading network operator in Namibia with a 76% market share and providing network spatial coverage 95% population (Mobile Telecommunications, 2018). The CDR dataset obtained from MTC included the time and routing tower for each call and text and a random uniquely hashed number for each user. The approximate location of a user was defined by the location of the routing mobile phone tower for each communication. The data were spatially aggregated to regional level to match the census migration data and to further reduce sensitivities of using individual level data. We estimated a user's place of residence for a given period as the region where the user was observed most frequently during the period of interest. As the data on very infrequent mobile phone users or seasonal movement (e.g., short-term travels in holidays), might introduce noise in defining residential places, we only included any user who was active for more than 30 days each year (12 months) defined as below.

To match as closely as possible the time frame used in census and to be comparable between the 2011 and 2012 periods, we defined the residence of each user for each year: Year 1 (October 2010–September 2011), Year 2 (October 2011–September 2012), and Year 3 (October 2012–September 2013), respectively. We derived migration flows of mobile phone users for periods 2011 and 2012 by comparing residences between Years 1 and 2, and between Years 2 and 3, respectively. If mobile users changed residence between the two years, they were identified as migrants, otherwise as non-migrants. In addition, we also assessed the potential impact of data filtering and different time lengths on defining residences (Supplementary information [SI] text).

### Model covariates

For estimating migration by models for the 2011 period, we also collated potential migration-related demographic, socioeconomic, geographic, and environmental variables, as described in previous studies (Garcia et al. 2015; Sorichetta et al. 2016), including population by region in 2010 and 2011 (Namibia Statistics Agency, 2013); the proportions of population living in urban areas, male population, population aged 15–59, educated population, labor force participation, and marital status in population at aged 15 years and above; administrative unit boundaries to define the distance and contiguity between regions and their area (Zhao et al. 2012); and the average annual precipitation by region. The collation of covariates is detailed in the SI Text.

## Models and analysis

We fit three types of models to census data to explore whether CDR-derived migration data can accurately replicate traditional census-derived migration statistics. Three types of models were included (Table S1): (1) CDR-based linear models (CDRLMs), simply using CDR-derived migrating user data alone or combined with covariates used in gravity models; (2) gravity-type spatial interaction models (GTSIMs), which have been applied extensively to estimate migration flows based on a range of migration-related push-pull factors including populations and distance between origin and destination (Zipf, 1946; Hua and Porell, 1979; Garcia et al. 2015; Wesolowski et al. 2015b; Ruktanonchai et al. 2016a; Sorichetta et al. 2016; Vobruba et al. 2016); and (3) GTSIMs extended using CDR data (thereafter called CGTSIMs).

### CDR-based linear models

Initially, we used Pearson correlation coefficients to assess the relationship between CDR and census data. To investigate how well the CDRs can replicate the census migration numbers, we built four sub-models of CDRLMs using independent variables of CDR-derived migrating user numbers or integrating with other covariates:

$${\mathrm{MIG}}_{i,j} = \beta _0 + \beta _1{\mathrm{CDR}}_{i,j} + \overrightarrow \beta \left[ X \right]$$
(1)

where the dependent variable MIGi,j is comprised of the observed migration flows between regions in Namibia from the census. CDRi,j is the number of CDR-derived migrations from origin i to destination j, with the coefficient β1 and the constant β0. The suite of models was built by successively adding same covariates that were used in GTSIMs and represented by the matrix X and its vector of coefficients $$\overrightarrow \beta$$.

### Gravity-type spatial interaction models

In the simplest form of gravity models (Zipf, 1946), the flow of migration between regions is proportional to their total populations and inversely proportional to the distance between them:

$${\mathrm{MIG}}_{i,j} = \frac{{{\mathrm{POP}}_i^{\beta _1}{\mathrm{POP}}_j^{\beta _2}}}{{{\mathrm{DIST}}_{i,j}^{\beta _3}}}$$
(2)

where POPi and POPj refer to populations at an origin i and a destination j in 2010, respectively; DISTi,j represents the distance between i and j; The exponents, β1, β2, and β3, are used to indicate the magnitude of the effect for each variable.

As a range of potential push-pull factors, e.g., urbanization and natural disaster, could affect human migration, the models can be further extended to reach more accurate estimates as described in previous studies (Garcia et al. 2015; Sorichetta et al. 2016). However, given that the number of regions in Namibia is small (13 regions) and to prevent overfitting, we only tested models by replacing the total population variables with the percentage of population living in urban areas (URBANi and URBANj) and the precipitation (RAINi and RAINj) in origin and destination, respectively (SI text). Although both logistic and Poisson regressions have been widely used in gravity models to predict migration flows, the outputs from logistic regression should be identical to estimates of Poisson regression by adding an offset variable of non-migrating populations (Garcia et al. 2015; Ruktanonchai et al. 2016a; Sorichetta et al. 2016). Therefore, we only fit GTSIMs using the logistic regression function here:

$$\frac{{{\mathrm{MIG}}_{i,j}}}{{{\mathrm{TOT}}_i}} = \frac{{e^{\beta _0 + \beta _1P_i + \beta _2P_j - \beta _3{\mathrm{DIST}}_{i,j}}}}{{1 + e^{\beta _0 + \beta _1P_i + \beta _2P_j - \beta _3{\mathrm{DIST}}_{i,j}}}}$$
(3)

where TOTi represents the total population residing in an origin i in 2010, and where Pi and Pj refer to the push factor at origin and pull factor at destination, respectively (Table S1). Moreover, the CGTSIMs with additional CDRs variables were tested to assess how well the CDR-derived migration data could improve the performance of gravity models.

### Model comparisons

By fitting to census statistics for each model, we used a leave-one-out-cross-validation approach (Hastie et al. 2009) to split the dataset to calculate the goodness-of-fit indicators, including root-mean-square error (RMSE), R-squared (R2) and Akaike Information Criterion (AIC). The model with the lowest RMSE was determined as the best model of each model family. The estimates of migration between regions were then calculated using the optimal model, and the inflow, outflow and netflow for each region in Namibia were also aggregated.

As our models used non-spatial regression approaches, and spatial autocorrelation may exist in migration data (Tobler, 1970; Getis, 2008; Sorichetta et al. 2016), a shuffle test was used to assess whether any spatial dependencies significantly affected the performance of our models. First, we randomly permuted the census-derived migration data across all regions. Then each model was fitted to calculate RMSE by using each shuffled dependent variable, and the distribution of RMSE could be produced through 1000 iterations. If the “real” RMSE of each model that was fitted with the “ground truth” migration data was less than all 1000 simulated values of RMSE using the shuffled data, we assumed that the spatial dependencies were not significant in our models. All analyses were done within the R statistical environment (version 3.5.2), and fitting procedures of models were conducted using caret Package (Kuhn, 2008; R Core Team, 2018).

### Estimating migration for the 2012 period

Due to the lack of migration statistics in 2012 for fitting models in Period 2012, the CDRLM using only CDR data and its coefficients fitted for Period 2011 were used to predict the migration for Period 2012 and compare the pattern of migration across periods. Moreover, to account for increasing numbers of mobile phone users from 2011, the CDR-derived data for migrating users for Period 2012 were inversely weighted by the increasing rate of mobile phone users for each region to offset the potential bias introduced by increasing mobile ownership across periods.

### Mobile phone ownership analysis

As mobile phone users only represent a proportion of the whole population, we utilized data from the 2013 Namibia DHS (The Namibia Ministry of Health et al. 2014) to assess the extent to which there is a possible exclusion of certain groups at a household level within the CDRs in the context of Namibia (SI text). To account for potential mobile phone ownership biases across regions, the models mentioned above were also tested by using CDR data adjusted by two approaches respectively: (1) using the proportion of mobile phone ownership to inversely weight CDR-derived migration data by region; and (2) adding the proportion of ownership as an additional variable into models.

## Results

### Correlations between census-derived and CDR-derived migrations

According to the 2011 Namibia population and housing census (Namibia Statistics Agency, 2015), a total of 40,867 Namibian (2.0% of 2,013,671 people) migrated by changing their places of residence between regions in Namibia over the one-year period prior to the census in August 2011, with the highest migration into Khomas, the capital region of Namibia, and the highest migration out from the Zambezi region in the northeast of Namibia (see Fig. 1 and S1). Based on the anonymized CDRs in Namibia between October 2010 and April 2014, we estimated the number of migrating mobile users by comparing their residences between two years of October 2010–September 2011 and October 2011–September 2012 in Period 2011 (SI text; Figs S2 and S3). A high correlation (Pearson's coefficient, r = 0.91) was found between the numbers of census-derived population and mobile phone users included in Period 2011 (Fig. 2a). Furthermore, the migration flows were also highly correlated (r = 0.84) between census data and CDR-derived 117,173 migrating mobile users (11.2% of 1,049,379 users) (Figs S4 and S5).

Substantial differences in the Zambezi region were observed when comparing the census and CDR data, with more census-derived migrants than from the CDRs (Figs S5 and S6). The Zambezi region lost a significant proportion of its population (5.5%), which was attributed to displacement due to floods in the period of April-June 2010, out of the time frame of the census and CDRs (International Federation of Red Cross and Red Crescent Societies, 2011; Namibia Statistics Agency, 2015). According to definitions used in census (SI text) (Namibia Statistics Agency, 2015), if people moved to the places of displacement before September 2010 and still lived in the same places by the time of census, they should be considered as non-migrants. Therefore, the displaced populations from Zambezi before September 2010 may well have been misclassified as migrants in the census. Moreover, based on the data of CDR-derived monthly residence, the inflow and outflow of Zambezi seem to be seasonal without aberrational high movements from October 2010 to April 2014 (Fig. S7). After removing the data from Zambezi, the relationships between census-derived and CDR-derived migration data significantly improved, with the r value increasing from 0.84 to 0.96. Therefore, we present the following results without the Zambezi region, and relevant comparable analyses for all regions are provided in the SI.

### Comparing migration prediction models

In general, the goodness-of-fit indicators, including RMSE, R2, and AIC, show that CDRLMs using only CDR data could precisely and accurately replicate census-derived statistics, with a better predictability than GTSIMs (Figs S8S10). Moreover, the performance of GTSIMs could be substantially improved by using CDRs. Comparing the “real” RMSE with the distributions of RMSEs generated by the shuffled census data, it was evident that spatial autocorrelation was not significant in our models (Fig. S11). According to the optimized model with the lowest RMSE, all three families of models could capture the patterns of migration flows between regions (Fig. S12), but CDRLMs had a higher accuracy in predictions compared with GTSIMs and CGTSIMs (Fig. 3). Additionally, in terms of outflow, inflow, and net migration aggregated by region, the estimates from CDRLM were highly correlated (R2 = 0.97, 0.97, and 0.94 respectively) with the census-derived data (Fig. 4 and S13).

### Mobile phone ownership bias and model adjustment

As mobile users only represent a proportion of the population, to understand the potential phone ownership bias, we utilized data from the 2013 Namibia Demographic and Health Survey (DHS) (The Namibia Ministry of Health et al. 2014) to assess to the extent to which there is a possible exclusion of certain groups with specific characteristics from CDRs in Namibia. The 2013 DHS reported that although the large majority (88.5%) of households interviewed owned at least one mobile phone (The Namibia Ministry of Health et al. 2014), the lower-income and rural households with older and uneducated heads were less likely to be able to afford a cell phone, and there was a significant ownership differential between regions in Namibia (SI text; Tables S2 and S3). To account for the potential mobile ownership bias between regions, two approaches were used to adjust CDRs, respectively. However, the performance of both CDRLMs and CGTSIMs were not significantly improved by these adjustments (Figs S8S10).

### Predicting migration in 2012

The multiannual time series of CDRs in Namibia allows us to assess their potential to be used to update intercensal national statistics and understand the changing patterns of internal migrations across years. By comparing the places of residence between the two years of October 2011–September 2012 and October 2012–September 2013 (hereafter called Period 2012), we captured 144,064 migrants in 1,238,124 mobile users, with a similar proportion of 11.6% as Period 2011. The increasing numbers of migrations between periods was likely due to the increasing penetration rate of mobile phones across years (Figs S4 and S14). To compare migration patterns between two periods, we adjusted the number of CDR-derived migrating users in Period 2012 by region to offset the increasing mobile phone ownership across periods. Then, the simplest CDRLM using only CDR data and its coefficients estimated for Period 2011 were used to predict migration for Period 2012 using the corresponding adjusted CDR data (Fig. S14). We observed highly consistent patterns of migration flows between Periods 2011 and 2012 as well as the outflows, inflows and net migration aggregated by region (Figs S14S16). However, the relative differences across periods show greater variations in outflow than in inflow between regions, with more people moving out from the West-South regions and into the northern regions in Namibia (see Fig. 5).

## Discussion

Migration is difficult to measure frequently, particularly at local scales, and data from censuses are typically collected just once every decade, pushing a need for innovation in the production of migration statistics (International Organization for Migration, 2018). The penetration rate of mobile phones is now high across the globe, and analyzing the changing spatiotemporal distribution of mobile phone users through anonymized CDRs offers the possibility to measure migration at multiple temporal and spatial scales. Global mobile phone network subscriber numbers passed the five billion mark in 2017 with a global penetration rate of 66%, and the number is forecasted to continue to grow, moving upto 71% by 2025, with rapid recent increases in ownership in low-income countries (The GSM Association, 2018). The data collected every second by mobile network operators have the potential to contribute to the “big data revolution” in complementing more traditional statistics through updating internal migration statistics in a timely, accurate and low-cost way.

This study demonstrates how the analysis of CDRs can replicate national internal migration statistics to complement outputs from censuses. The multiannual time series of CDRs with high spatiotemporal resolution facilitates the derivation of residence measures, matching closely the definitions used in censuses. We found that not only can the estimates of migration produced through CDRs be as accurate as census data-derived measures, but these data offer additional benefits in terms of updating intercensal migration numbers and understanding changing patterns of annual internal migration. Additionally, the methodologies presented are designed to be easy to implement while considering the impact of heterogeneous phone ownership across regions and years, and the simple linear model built using CDRs results in estimates with high precision and accuracy.

Results here suggest that CDRs can also improve the performance of gravity models. The GTSIMs explicitly state the spatial interaction relationship between migration and the push-pull factors that represent the benefits and costs of migration (Zipf, 1946; Hua and Porell, 1979). The estimates made using gravity models contribute to a better understanding of migration patterns, with known boundaries to their accuracy in the absence of censuses or surveys. However, due to the lack of high spatiotemporal resolution input data on contemporary population movements, such models used in previous studies resulted in high uncertainties in estimates (Garcia et al. 2015; Sorichetta et al. 2016; Vobruba et al. 2016). Though biases exist, as CDR-derived migration data directly relate to populations who moved across the country over years, a combination of CDRs and other migration-related covariates could facilitate a significant improvement in the precision and accuracy of outputs from gravity models.

Internal migration is common in Namibia, and we estimated a larger number of migrating mobile phone users compared with those migrating within the census data. One reason is that CDRs do not suffer from recall bias (Wesolowski et al. 2014b) and capture missing data from people who moved, but did not register their previous residence in the census. Moreover, different time windows for data capture may also have contributed, with the CDR-based home definition window used here being wider than the census collection date. As elsewhere, the largest proportion of migration in Namibia is rural-to-urban migration, a phenomenon that relates partly to rapid urbanization (Garcia et al. 2015; Namibia Statistics Agency, 2015; International Organization for Migration, 2016). However, to accurately derive these migration flows and patterns using CDRs, any impacts from seasonal temporary movement should be minimized, such as holiday-related travel in December, patterns that are highly repetitive in Namibia (Zu Erbach-Schoenberg et al. 2016; Wesolowski et al. 2017). Using a 12-month time frame to define residence of mobile users may prevent bias of residence towards the temporary locations of seasonal travel (SI text). Further, the high temporal resolution of longitudinal CDRs enables the derivation and update of different statistical indicators of migration using varying periods, e.g., 2-year or 3-year migrations.

Some limitations must be acknowledged. First, to prevent overfitting and multicollinearity, our models did not test a large number of demographic, socioeconomic, geographic, and environmental factors and their combinations that might potentially affect migration as described before (Henry et al. 2003; Henry et al. 2004; Garcia et al. 2015; Wesolowski et al. 2015b; Ruktanonchai et al. 2016a; Sorichetta et al. 2016; Vobruba et al. 2016). Another methodological shortcoming is the lack of correction for spatial autocorrelation in the modeling by using a spatial regression model. However, a shuffle approach showed that any spatial dependencies likely did not significantly affect the performance of our models.

Mobile users only cover a proportion of the population, therefore, CDRs may provide an incomplete picture, not accounting for those who do not own and use a phone, mobile phone sharing, network coverage, or alternative networks. The spatiotemporal and demographic variations in the behavior of phone users can also bias population distribution and migration estimates (Lu et al. 2012; Deville et al. 2014). Mobile phone ownership typically biases toward more educated, urban males (SI text), and mobile network coverage may be substantially lower in remote rural locations (Wesolowski et al. 2017). However, a high proportion of the population in Namibia were SIM card owners that appeared in the CDRs (Stork, 2011), and a high share of ownership at household level was also found in the 2013 DHS data (The Namibia Ministry of Health et al. 2014). With continuously increasing mobile coverage and declining costs for handsets and network usage, the proportion of people owning and using mobile phones has been steadily increasing (The GSM Association, 2018), which will also decrease the influence of the problem of phone sharing, which is common in areas with low cell phone penetration.

In addition, to account for the impact of increasing user numbers across years on migration estimates, we adjusted the CDR-derived data for comparing interannual migration patterns, but these only represent an initial step for adjusting for mobile phone usage changes. Future studies on estimating migration could use other appropriate data, such as travel history and mobile phone use surveys to infer possible correlation in mobile use and migration in demographic-specific subgroups. In addition, due to the availability of data, we only investigated here internal migration over the course of a year. Long-term internal migration (>5 years) could be estimated by analyzing CDRs over a longer period and these could be integrated with additional data sources, such as Google Location History data (Ruktanonchai et al. 2018), to address relevant underlying research questions and technical issues in the future.

The results here show that estimates of migration flows made using CDRs is a promising avenue for complementing more traditional national statistics and obtaining more timely and local data. The metrics and approaches can inform distinctly different policy-relevant needs that require migration statistics and the implementation of policies geared towards providing relevant public services. Partnerships between governments and phone companies supported by appropriate incentives could enable accurate and rapid production of national migration statistics to complement census and survey-based data collection.