Main

Every year, tens of millions of individuals and families are forcibly displaced by armed conflict, creating enormous humanitarian, social and economic costs1,2. Recent estimates indicate that the number of people living in proximity to conflict has doubled since 20073. Despite the global scale and importance of this crisis, empirical analysis of the link between violence and displacement is complicated by the inherent difficulty of observing displaced populations—a difficulty that is compounded in developing countries and in insecure environments3,4.

This paper develops and tests an approach to studying the impact of violence on internal displacement. Our first contribution is methodological and illustrates how high-frequency ‘digital trace’ data can enable different approaches to identifying and estimating the causal effect of violent events on internal displacement. We use a panel event study framework that is feasible only because we observe the locations of millions of individuals near-continuously over time: the high-frequency data allow us to draw causal inferences from discontinuities in spatio-temporal trajectories that coincide with specific violent events. Such an estimation strategy would not be feasible with traditional survey-based data, which track a relatively small number of individuals at infrequent intervals. This complements recent qualitative5,6, survey-based7,8,9 and observational studies10,11,12 on conflict and displacement in the developing world. It also builds on recent work using non-traditional sources of digital data to study the movement of human populations13,14,15,16,17,18,19,20,21,22,23,24.

Our second contribution is to provide rich evidence on the impact of violence on internal displacement in Afghanistan—a country that has experienced decades of conflict and that contains over two million internally displaced people1. We contribute a quantitative perspective on the nature of this displacement, complementing more traditional approaches based on surveys and administrative reporting25,26,27.

Our analysis is based on the universe of mobile phone activity from Afghanistan’s largest mobile phone operator for a four-year period from April 2013 to March 2017. This dataset contains the anonymized mobile phone metadata of approximately ten million mobile subscribers (a non-random subset of all individuals in Afghanistan, as we discuss in greater detail below). We separately obtain geo-coded information on fatal violent events in Afghanistan, which is collected by the Uppsala Conflict Data Program (UCDP) from public media reports28. We use 3,354 events in our analysis, corresponding to the subset of all events that are recorded with sufficient spatial and temporal precision (53% of all recorded events) and which overlap with our phone data; the limitations of the UCDP data are discussed in the ‘Limitations’ section of this paper. The spatial distribution of mobile phone towers and violent events can be seen in Fig. 1 and Supplementary Fig. 1.

Fig. 1: Map of cell towers and violent events in Afghanistan, 2013–2017.
figure 1

District boundaries are represented by grey lines. There are 1,439 cell tower groups (black dots) and 5,984 violent events (red dots)75, 3,354 of which occur in districts on days with mobile phone activity. Event locations are marked by exact geocoordinates when available, or with the district centroid when only the district name is known.

The empirical approach we develop uses the mobile phone data to observe the movement of subscribers between regions, and a statistical model to estimate the causal effect of violence on this movement (see Methods for a detailed description). We first use metadata on the sequence of cell towers used by each mobile subscriber to identify each subscriber’s home district (the smallest administrative unit in Afghanistan). We then adapt recent algorithmic advances in the measurement of migration from digital trace data29 to identify days on which the subscriber’s home district changes—we refer to this as a ‘migration’. (The International Organization for Migration (IOM) defines migration as “the movement of persons away from their place of usual residence, either across an international border or within a State”30. Our main analysis focuses on migrations that involve a person leaving their origin district for at least a week; we validate our empirical measures of migration with IOM statistics in the Methods, section ‘Data validation’ and Supplementary Fig. 2.) These individual migration events are then aggregated to produce estimates of the total population flows between each pair of districts on each day. Finally, we use a high-frequency panel event study design to estimate the causal effect of violence on migration, which we measure as the increase in movement of subscribers out of a district impacted by violence, relative to movement from the same district on non-violent days, while controlling for seasonality and other temporal factors. Specifically, we regress total out-migration from each district on each day on a vector of binary indicators for whether violence occurred in that district on that day, in the 180 days prior (the ‘lags’) or the 30 days after (the ‘leads’) that day. We control for district and day fixed effects and cluster standard errors at the district level. This approach allows us to estimate the ‘average displacement effect’ of violence—averaged over the 3,354 violent events in our dataset—relative to movement that occurs in the absence of violence. This modelling approach and its identifying assumptions are described in detail in the Methods (section ‘Panel regressions: measuring k-day displacement’).

We frequently refer to this excess migration caused by violence as ‘displacement’, though we acknowledge that this statistical notion of displacement is different from that used by international organizations. For instance, the IOM defines displacement as “the movement of persons who have been forced or obliged to flee or to leave their homes or places of habitual residence, in particular as a result of or in order to avoid the effects of armed conflict, situations of generalized violence, violations of human rights or natural or human-made disasters.”30

The results presented below illustrate the additional perspective on internal displacement that can be achieved through the analysis of population-scale digital trace data. However, this approach has important limitations. Some of these we can address (as discussed in the Methods), but others are more fundamental21,31,32,33,34. We discuss a few key limitations—such as issues of population representativity, data access and privacy—in the ‘Limitations’ section, after describing the empirical findings.

Results

The aggregate effect of violence on displacement

Violence in Afghanistan causes internal displacement (Fig. 2). Among subscribers who were present on the day of a violent event, there is an immediate and statistically significant increase in the likelihood of leaving the district (Fig. 2a). This increase peaks roughly ten days after violence, when the odds of being observed in a different district are 4.0% higher than in the absence of violence (95% confidence interval (CI), (2.6%, 5.5%); P < 0.001; this and all subsequent reported statistical tests are two-tailed tests). Violence-induced displacement is persistent: even 120 days after the violent event, subscribers who were present for the event are roughly 2% (95% CI, (0.5%, 2.9%); P = 0.007) more likely to still be outside of the district. We show that these results are robust to several potential data and modelling issues in the Methods, section ‘Impact of a violent day’.

Fig. 2: The effect of violence on internal displacement.
figure 2

a, Displacement for those present on the day of the violent event. Exponentiated regression coefficients, plotted on the y axis, indicate the increase in the odds that individuals who were in the district on the day of violence are in a different district k days later. The bars indicate 95% CIs. The estimates are based on 3,354 violent events. τ indicates the number of lags for treatment variables in the regression model (Methods), representing the number of days since violence. The full results are provided in Supplementary Table 2. b, Thirty-day displacement. Exponentiated regression coefficients indicate the increase in the odds that individuals are in a different district than they were 30 days prior. Negative x-axis values correspond to days preceding violence. The bars indicate 95% CIs. The estimates are based on 3,354 violent events. τ represents days since violence. The full results are provided in Supplementary Table 3.

We also find evidence that displacement often anticipates violence. This is apparent in Fig. 2b, which shows the increase in the odds of an individual being in a different district from their location 30 days prior (Methods, section ‘Panel regressions: measuring k-day displacement’). Roughly five days before violent events are reported in the media, subscribers start to leave the impacted regions. In the Discussion, we consider several possible explanations for this unexpected result.

What types of violence cause the most displacement?

The aggregate effects shown in Fig. 2 mask substantial heterogeneity in how different types of violence cause different patterns of displacement. Most notably, violence involving the Islamic State (IS), while less frequent, causes significantly more displacement than violence involving the Taliban (Fig. 3a). The difference is the most pronounced in the immediate aftermath of the event, when the increase in displacement is ten percentage points higher for violence involving IS (for subscribers present during the violence, events involving IS increase the odds of displacement a day after by 12.7% (95% CI, (5.7%, 20.2%); P < 0.001); Taliban, 1.9% (95% CI, (−0.4%, 4.3%); P = 0.111); Taliban–IS difference in coefficient estimate, 0.10 (95% CI, (0.04, 0.17); P = 0.002)). Such evidence is consistent with the fact that IS attacks are particularly brutal and frequently target civilians; the Taliban have condemned IS attacks as ‘heinous’35,36.

Fig. 3: The effect of violence on displacement, disaggregated by type and location.
figure 3

ad, The impact of a violent day on individuals who were in the district on the violent day. The coefficients are estimated separately for each panel. Panel a shows displacement for IS (N = 185) versus Taliban violence (N = 3,134); the full results are provided in Supplementary Table 4. Panel b shows displacement for events with 11 or more casualties (N = 397) versus 10 or fewer casualties (N = 2,957); the full results are provided in Supplementary Table 5. Panel c shows displacement for events following fewer than 60 days of peace (N = 2,319) versus 60 or more days (N = 1,035); the full results are provided in Supplementary Table 6. Panel d shows displacement for events in provincial capitals (N = 2,460) versus non-capitals (N = 894); the full results are provided in Supplementary Table 7.

High-casualty and high-frequency violence also have larger effects on displacement. Figure 3b shows that the highest-casualty events (11 or more casualties, roughly equivalent to the 10% of events with the most casualties) cause more displacement at all periods following the event, relative to lower-casualty events (P < 0.001 for a paired t-test of the 120 values shown in the figure; mean difference, 0.034; 95% CI, (0.031, 0.036)). Figure 3c indicates that violent events in regions that have recently experienced violence lead to more displacement than violence in regions that have experienced a period of relative peace (P < 0.001 for a paired t-test; mean difference, 0.030; 95% CI, (0.029, 0.031)). This result is perhaps surprising given prior work suggesting that people in chronically violent areas may acclimate to conflict37,38,39. However, in the context of the prolonged and pervasive Afghan conflict, we take this as evidence that people are more likely to flee their homes after recent and sustained violence but may be willing to withstand ‘idiosyncratic’ violence.

We also find that the displacement response depends on whether the violence occurs in a provincial capital (which is typically an urban or peri-urban area) or a more rural non-capital district (Fig. 3d). Although provincial capitals appear relatively resilient to violence, with muted and largely insignificant increases in displacement, the effects of violence outside of provincial capitals are large and persistent (P < 0.001 for a paired t-test; mean difference, 0.026; 95% CI, (0.024, 0.028)). Afghanistan’s 34 provincial capitals are regional seats of government and are where the Afghan National Security Forces are typically concentrated40. As a result, they may be seen to offer relative safety compared with outlying districts. We also find suggestive evidence that displacement effects are the highest in areas that are under insurgent control or contested by the Taliban (Supplementary Fig. 3). We do not emphasize these results, since our data on insurgent control are not ideally suited for this analysis (‘Limitations’).

The heterogeneous displacement responses shown in Fig. 3a–d highlight how each separate characteristic of violent events—the parties involved, the severity and recency of violence, and the location of the attack—relate to subsequent displacement. However, some of these characteristics are correlated: for instance, violence in provincial capitals tends to be preceded by fewer days of peace. For this reason, Fig. 4 shows the joint relationship between these characteristics and displacement—that is, it shows how each factor correlates with displacement, holding the other factors fixed (using a regression model; Methods). Here we observe that the general patterns are consistent with the earlier results, but the outsized impact of IS is made clear: all else equal, violence involving IS has the largest and most pronounced impact on short- and long-term displacement. (For example, the difference in coefficient estimates for IS events versus events following recent violence, for the model involving the average effect of violence 1–15 days in the past, is 0.194 (95% CI, (0.032, 0.356); P = 0.019).)

Fig. 4: How event and location characteristics affect the impact of each event on 30-day displacement (N = 2,359).
figure 4

The different colours denote different outcome variables: the average effect of violence when it occurred 1–15, 16–30, 31–45, 46–60, 61–75 and 76–90 days in the past. The bars indicate 95% CIs. The full results are provided in Supplementary Table 8.

Where do the displaced go?

The analysis of anonymized mobile phone metadata can also provide granular insight into the destinations of the displaced. To build intuition, Fig. 5 shows the flow of migrants in Afghanistan during normal times—that is, on days when violence does not occur. On such non-violent days, the total volume of mobile subscribers leaving capitals and non-capitals is approximately equal. For those moving from capitals, 73.5% move to a different province, and roughly half (47.0%) move to other capitals or major cities; of the subscribers leaving non-capitals, half of them (50.6%) move to another province, and 30.0% move to the provincial capital in the same province.

Fig. 5: Migrant flows on days without violence.
figure 5

The proportion of subscribers moving between locations of different types, where a move is defined as a change of home district over a 30-day period.

More revealing is how the equilibrium pattern of displacement shifts in response to violence. These results are summarized in Fig. 6, which shows how the odds of moving to each type of district change on days with violence. Violence affecting non-capital regions (left) makes subscribers more likely to go to the provincial capital of their origin district and less likely to go to the largest cities (Kabul, Kandahar, Hirat, Mazari Sharif and Jalalabad) outside their origin province. By contrast, violence affecting capital regions (right) tends to drive subscribers away from their home province and to either the five major cities or more rural non-capitals. We discuss these and related results below.

Fig. 6: The effect of violence on destination choice.
figure 6

How movement from districts on days of violence differs from movement from districts on days without violence, where a move is defined as being in a different district 7, 30 or 90 days after the reference date (Methods). Left, movement from non-capitals; right, movement from capitals. Red denotes movement between provinces, and green denotes movement within a province. Destinations are divided by the type of district: one of the five largest cities (Kabul, Kandahar, Hirat, Mazari Sharif and Jalalabad), another of the provincial capital districts or a non-capital district. The bars indicate 95% CIs. The estimates are based on up to 894 violent days in non-capitals and 2,460 in provincial capitals. The full results are provided in Supplementary Table 9.

Discussion

The preceding results illustrate a granular and dynamic empirical approach to studying conflict-induced displacement. The aggregate finding that violence causes displacement is consistent with prior work on internal displacement in fragile and conflict-affected countries7,8,12. In Afghanistan (the context of our analysis), IOM survey data from 2019 suggest that conflict had caused roughly two thirds of all current displacement25.

The main innovation of this approach is that it permits a fine-grained, quantitative analysis of violence-induced displacement that would be difficult to accomplish using household surveys and qualitative interviews. For instance, our analysis documents the important role of provincial capitals in influencing both the impact of violence and the destinations of displaced people. One specific finding is that when violence occurs in non-capitals, subscribers tend to flee to their home provincial capital. Such evidence corroborates qualitative findings that “people displaced by conflict and violence tended to try to stay as close as possible to their homes, moving from rural areas to the provincial capital or a neighbouring province”41. The attraction of provincial capitals is probably due to several factors. First, capitals have a higher concentration of government security forces, and in the wake of violence, physical security is probably a crucial consideration. Relatedly, humanitarian aid—whose effectiveness depends on the security of the location42—is often most easily accessed in provincial capitals. Furthermore, provincial capitals are often the most urbanized area in the region, potentially offering greater economic opportunities43. Finally, movement to provincial capitals may create a feedback loop, where families are likely to have connections in provincial capitals, thus encouraging further movement to these capitals26,44,45.

We also find that the types of violence associated with more displacement include violence related to IS, high-casualty violence and chronic violence. These results can be explained by the level of risk perceived by individuals and households influencing their decision to flee7,8. Apart from indiscriminate attacks, IS has orchestrated vivid displays of violence, such as filmed executions, advertising its brutality and intimidating opponents46. The Taliban, by contrast, has support among some segments of society and is seen by some as a legitimate governing force47. Separately, it is plausible that high-casualty and chronic violence similarly create an atmosphere of fear that leads to higher displacement. Casualties have been found to be associated with insurgent recruitment and violence48; their link to displacement is perhaps unsurprising.

Our analysis also indicates that people appear to anticipate the occurrence of violence, leaving before it occurs—a finding that relates to prior work on the predictability of violence and conflict49,50,51,52, including recent work using mobile phone data53. This anticipatory effect is most pronounced with recently experienced violence (Supplementary Fig. 4). There are several possible related explanations for the anticipatory response we observe in Afghanistan. First, both NATO forces and the Taliban frequently warned civilians prior to major operations—for example, by distributing leaflets54 or ‘night letters’55. Relatedly, it may be that people were not anticipating a specific event but were rather responding to a general period of unrest; individuals might perceive a threat of violence before a recorded event actually takes place. (Note that this does not violate the causal identification assumption of no unobserved time-varying confounders, unless the perceived threat of violence causes violence to occur.) For example, there might be skirmishes between armed forces that do not lead to fatalities or are not reported in the media. This is consistent with survey-based evidence, which finds that the perceived threat of violence or presence of armed forces is sufficient to cause displacement, independent of the actual exposure to violence7,8,9. Rumours and word-of-mouth may also play a role: prior work suggests that information about violence spreads quickly56,57—including unverified rumours58—and that rumours may prompt people to take action to protect themselves59.

Limitations

While passively collected mobile phone data create new possibilities for understanding forced displacement, there are important limitations and considerations regarding the use of such data (for more systematic reviews, see refs. 21,31,32,33,34). As noted, our measures of migration and displacement only cover interdistrict movement of phones within the country of Afghanistan, and thus probably underestimate the overall effect of violence on displacement, which includes international and within-district movements.

More generally, there are several issues related to the representativity of the data and the potential biases that may result. Our data reflect the displacement patterns of mobile phone owners who have an active account on one specific commercial network, and not the full Afghan population. While mobility inferences on mobile phone owners have been shown to correlate with the mobility of non-owners in certain contexts60, it is possible that violence induces different types of displacement among phone owners and non-owners. Specifically, the most vulnerable populations, such as women, children and lower-income individuals, tend to be underrepresented in mobile phone data, as they are less likely to own a phone61,62. Relatedly, the data indicate only the intermittent and approximate locations of mobile phones, not of actual individuals. When phones are shared, powered off or disused, this can introduce measurement error. Vulnerable populations may use their phones less often, resulting in larger errors for these individuals63. We address some of these issues through data processing, estimation methods and robustness checks (Methods and Supplementary Fig. 5) but cannot eliminate all such concerns.

Another obstacle to the widespread use of these methods of studying internal displacement is that mobile phone data are not always easily accessible and may require partnerships with data owners. However, there are encouraging signs that private companies (such as Facebook, Google and Safegraph64,65) may be interested in making mobility data available to researchers and humanitarian organizations31,66,67. Finally, the analysis of phone data must respect the privacy of individual subscribers, particularly when dealing with displaced and otherwise vulnerable populations68,69. Our analysis involves only anonymized data that are aggregated geographically (by district) and temporally (by day). More generally, de Montjoye et al.70, Mayer et al.71 and others provide broader frameworks for the privacy-conscientious use of mobile phone data.

We also note limitations of the data we use to measure violence and conflict. We rely on data collected by the UCDP, which has known limitations that we discuss below. We also considered several alternative sources of data on violence, including the Armed Conflict Location and Event Data Project (ACLED)72, the Global Terrorism Database (GTD)73 and data on Significant Activities (SIGACTS)74. However, these alternative data sources either do not have full data available during our period of study (ACLED and SIGACTS) or focus on specific types of violence (that is, GTD focuses on terrorism). Thus, the UCDP data are the most suitable for our purposes, but they have their own limitations, in part because they are collected from media sources (media sources are supplemented by reports from non-governmental and intergovernmental organizations, field reports and books75). In particular, not all events are included in media reports; inclusion is affected by a large number of factors76. For example, more populous regions and economic centres have a larger media presence and are thus better covered than peripheral areas, while contested areas and places with less infrastructure tend to have poorer coverage. Media coverage is also influenced by how ‘newsworthy’ the event is likely to be77. Our analysis is further restricted to those events for which the date and location (that is, the district) are known. Since the spatial and temporal precision of event reporting is not random78, and that non-randomness may even differ across regions of Afghanistan, this could introduce bias into our analysis. For instance, if the threshold for reporting violence in rural areas is higher than the threshold in urban areas76,77, that could bias our analysis to finding larger effects of violence in rural areas. We perform robustness tests in the Methods (‘Implications of missing violence data’) to assess the likely magnitude of this bias, but we acknowledge that this is a limitation of our data—and a limitation shared by most empirical work on conflict. While these problems of exclusion are faced by all violent events datasets that rely on news sources, and there is little we can do to improve the quality of the data, we are optimistic that the quality of conflict event data will improve as they gain more widespread use. For instance, several papers now articulate best practices for the collection of conflict data79,80, and researchers are increasingly exploring how social media and other non-traditional data can be used to source conflict data81.

Our analysis is further constrained by the general scarcity of publicly available quantitative data covering Afghanistan during this period. For instance, Supplementary Fig. 3 provides suggestive evidence that displacement effects are greatest when violence occurs in areas controlled or contested by the Taliban—a finding that is supported by prior work on the dynamics of civil wars82. However, we consider such results preliminary, because they are based on a cross-sectional assessment of territorial control conducted in October 2017. Since territorial control changed rapidly during the period we study, a more robust analysis requires panel data on territorial control over time.

A broader limitation of our quantitative approach is that we cannot say much about what specific aspects of violence people are reacting to and whether observed displacement is voluntary or not. In aggregating effects over all violent events, we find that, on average, violence causes displacement—but this may hide the fact that certain types of violence lead to a reduction in mobility (for instance, if populations are trapped or involuntarily immobile83), if that reduction is offset by other types of violence that dramatically increase out-migration. Relatedly, in discussing the anticipatory response, we noted that recorded (fatal) events may be preceded by a broader context of non-fatal violence, and individuals may in fact be responding to broader changes in the local environment that we cannot directly observe in our data.

Finally, while we have taken several steps to improve the internal validity of our analysis, we acknowledge that we can only speculate about the extent to which our findings will generalize to other countries. The general result that displacement is affected by violence has been documented in other contexts, including Colombia8, Kosovo10, Nepal7, Somalia11, Syria9 and internationally12. However, the Afghan setting is unique in many respects, and some results (such as the magnitude of the estimated effect or specific elements of the heterogeneous response) may be specific to Afghanistan. For example, Afghanistan has a decades-long history of conflict; our results will probably translate better to settings with endemic conflict than ones where violence is new. Afghanistan also has strong clan-based, ethnic and sectarian divisions; a strong patriarchal society; and a relatively poor and young population. These contextual factors probably impact migration decisions in complex ways; our results will naturally be most relevant to environments with similar characteristics.

Conclusion

Despite these important caveats, we remain optimistic that mobile phone and other digital trace data offer great potential for the study of internal displacement. Our empirical analysis provides insight into the nature of violence-induced displacement in Afghanistan and helps quantify some of the human costs of violence that would be difficult to measure using traditional methods. While there are definite limitations to what can be observed through mobile phone data, conflict-prone regions are often also the places where traditional survey-based data are the least reliable and most difficult to obtain. Our hope is that this approach can complement traditional perspectives on displacement and eventually contribute to the design of effective policies for prevention and mitigation.

Methods

Data on violence in Afghanistan

We obtain violent events data from the UCDP, a leading source of data on conflict events72. Specifically, we use UCDP Georeferenced Event Dataset Global version 19.1, available at https://ucdp.uu.se/downloads/. This open-source collection of metadata on armed conflict and organized violence is collected from media reports, so it is likely to be biased towards salient events in more populous regions76,77. The criteria for inclusion of an event are “the incidence of the use of armed force by an organized actor against another organized actor, or against civilians, resulting in at least one direct death in either the best, low or high estimate categories at a specific location and for a specific temporal duration”75. In Afghanistan from 2013 to 2017, 5,984 events were recorded where the event location is known to a district level and the event time is known to a specific day; 4,740 of these events occur during the period April 2013–March 2017, for which we have call detail records (CDR), and 3,354 of those occur in districts on days with mobile phone activity. We discard events that are not recorded with this level of precision (47% of all events; the potential implications of this are discussed in ‘Limitations’). Afghanistan is divided into 398 districts in 34 provinces; our analysis is conducted on a district level.

Mobile phone call detail records

Our analysis of displacement is based on a large dataset of pseudonymized mobile phone data from one of Afghanistan’s largest mobile phone operators. As described in ‘Limitations’, we take precautions to ensure that the analysis of phone data respects the privacy of individual subscribers. In particular, our analysis involves only pseudonymized data that are aggregated geographically (by district) and temporally (by day).

We obtain CDR that provide metadata for every mobile phone call and data packet transfer that occurred on this network from April 2013 to March 2017—a total of roughly 20 billion events. For each such event, we observe a pseudonymized unique identifier for the subscriber (hashed from their phone number), the date and time of the event, and the identifier of the physical mobile phone tower through which the transaction was routed. We also know the exact location of each tower, which allows us to approximately identify each subscriber’s location at the time of the event, to within roughly 500 m in urban areas and roughly 10 km in rural areas.

There are 13,315 active towers during this period, many of which are very close together; we group these towers into 1,439 tower groups by combining towers less than 100 m apart. These cell tower groups are plotted in Fig. 1 and Supplementary Fig. 1. Only districts with cell towers are included in this analysis, though we note that these generally correspond to the more populated districts in Afghanistan (see Fig. 1).

Measuring migration

From the original CDR, we follow a sequence of steps to determine whether and when a migration event occurs. We adopt the IOM’s definition of migration, which is “The movement of persons away from their place of usual residence, either across an international border or within a State”30, and we focus on internal migration (“within a State”), where the place of usual residence is measured to a district-level precision. We capture trips that last approximately a week or more (at least five full days and two travel days). The migration that we measure is therefore an interdistrict movement. Complete details on this process are in the Supplementary Information (in the section on data processing); a brief summary is provided here.

Our first step is to derive a ‘daily modal location’ for each subscriber for each day, which is intended to capture the district in which the subscriber spends the majority of their time on that day. For each individual, we first compute their most commonly used cell tower in each hour. Then, for each 24-hour period from 06:00 to 06:00 the next day, we compute the mode of the hourly modal towers. The towers are then mapped to districts using point-in-polygon assignment. Similar methods have been used and validated in other work13,15,32,84,85,86,87. While several prior studies use night-time hours to infer daily locations, we instead use all hours, which allows us to include more individuals in our analysis. For example, in April 2013, data are available for approximately 31 million individual-days using night-time hours (18:00 to 06:00), while 61 million individual-days are defined when using all hours (06:00 to 06:00). The two approaches are highly correlated: of the 31 million observations available using night-time hours, 89% record the same daily modal districts when computed using all hours. Another common approach in the literature divides the physical terrain into approximate catchment areas of each cell tower, using a Voronoi tessellation (for example, refs. 13,88). Our analysis focuses on slightly larger administrative districts, since many of our violent events are identified at only the district level.

In Afghanistan, we find that the geographic distribution of daily modal locations of mobile phone subscribers broadly reflects the geographic distribution of the population. In particular, when comparing the number of mobile phone users in each district to the district population as estimated by Afghanistan’s Central Statistical Office, the Pearson’s correlation coefficient is 0.94 (95% CI, (0.92, 0.95); P < 0.001). (We calculate the number of subscribers in each district as the number of subscribers whose daily modal location is assigned to that district, averaged over all days in which the district has a non-zero number of subscribers, for a one-year period. Official estimates are obtained from https://data.humdata.org/dataset/estimated-population-of-afghanistan-2015-2016. On a log scale, the Pearson’s correlation is 0.53 (95% CI, (0.43, 0.61); P < 0.001).)

These daily modal locations tend to be sparse and noisy—for instance, many people do not use their phones on every single day, people may take short trips to nearby (non-residential) locations and so forth. Our second step thus employs an unsupervised scanning algorithm29 to identify contiguous segments in which a subscriber is, with high probability, resident in a single district. This algorithm helps smooth the influence of noise (for instance, long periods when a person is primarily in one location but intermittently visits other locations for one or two days) and missing data (for instance, when a person uses their phone infrequently, but almost exclusively from the same location) and has the advantage of not arbitrarily grouping days into calendar weeks or calendar months.

The third step is to identify migration events using discontinuous breaks in these contiguous segments. The second and third steps use the open-source Python package migration_detector (https://github.com/g-chi/migration_detector), which is specifically designed to infer migration events in transaction log data. The accompanying paper29 validates the use of these methods to measure migration. Tuning parameters are set to identify changes in locations resulting from stays of at least five full days in origin and destination districts. See the Supplementary Information for the full details.

The above procedures allow us to measure migration events from the mobile phone CDR. Many such events are not indicative of ‘displacement’, which the IOM defines as “The movement of persons who have been forced or obliged to flee or to leave their homes or places of habitual residence, in particular as a result of or in order to avoid the effects of armed conflict, situations of generalized violence, violations of human rights or natural or human-made disasters.”30 Given the limited contextual information available in the CDR, we cannot directly observe whether each inferred migration event should be considered a displacement. Instead, as we discuss in ‘Panel regressions: measuring k-day displacement’, we focus our analysis on the increase in out-migrations from a district that appear to be caused by violence in that district.

Data validation

To validate the measures of migration derived from the mobile phone CDR, we compare our derived migration metrics to displacement measures published by the IOM (DTM Afghanistan Districts Round 9 Baseline Assessment, available at https://data.humdata.org/dataset/afghanistan-displacement-data-baseline-assessment-iom-dtm). To our knowledge, there are no official or other published data measuring interdistrict migration as we do; while we try as far as possible to produce analogous measures, the IOM data measure fundamentally different quantities, and we do not expect comparisons to be identical. Generally speaking, we might expect province shares of migration and displacement to be similar if the fraction of displaced people among those who move for any reason is similar across provinces. This might not always be the case—for example, we might expect the capital, Kabul, to have a much smaller share of displaced people. Nevertheless, we make the comparison, as the IOM data are the closest published dataset on internal migration or displacement in Afghanistan.

The IOM collects data at the settlement (village) level through key informant interviews, focus group discussions and direct observation25. They use these data to estimate counts of outgoing and incoming internally displaced persons (IDPs) in assessed settlements over fixed periods. IDPs are categorized into ‘returnee IDPs’, ‘arrival IDPs’ and ‘fled IDPs’. We use the data collected in the year 2016; to our knowledge, this means that these individuals were recorded as being IDPs anytime during the year. We group ‘returnee’ and ‘arrival’ IDPs together as incoming IDPs, treat ‘fled IDPs’ as outgoing IDPs and sum the total numbers of incoming and outgoing IDPs for each province. We then compute each province’s share of the total incoming and outgoing IDPs.

Next, to construct an analogous metric from the CDR, we compare the district locations of each subscriber at the beginning and end of three four-month periods in 2016 (January–April, April–August and August–December, summed to obtain a measure of movement in 2016, since the longest we track subscribers is for 120 days). Since each district could have different cell-phone penetration rates, for each period and each district, we estimate the total number of people who moved in and out of the district by scaling the number of recorded subscribers who moved by \(\frac{{\mathrm{district}}\,{\mathrm{population}}}{{\mathrm{no}}.\,{\mathrm{of}}\,{\mathrm{recorded}}\,{\mathrm{subscribers}}}\), where the district population is as estimated by Afghanistan’s Central Statistical Office (available at https://data.humdata.org/dataset/estimated-population-of-afghanistan-2015-2016). We then aggregate these to the province level for 2016 and compute province shares in a similar manner. Supplementary Fig. 2a shows the share of each province estimated to leave; Spearman’s correlation between CDR and IOM statistics at the province level is ρ = 0.49 (95% CI, (0.20, 0.72); P = 0.004). Supplementary Fig. 2b does the same for incoming individuals, with ρ = 0.56 (95% CI, (0.31, 0.77); P < 0.001).

In Supplementary Fig. 2, we see that many provinces have similar shares of migration and displacement, with some obvious differences in Kabul Province, where migration far exceeds displacement, and Hilmand, where displacement far exceeds migration (Hilmand Province was a Taliban stronghold and frequently saw heavy fighting89).

Panel regressions: measuring k-day displacement

We combine the violent events data and migrations observed in the CDR into a district-day panel dataset, which we use to estimate the ‘average’ impact of violence on out-migration from the district in which violence occurs. We estimate this effect by adapting widely used panel regression models to our context (for example, refs. 90,91,92), which allows us to estimate the total migration caused by violence while controlling for unobserved district- and time-related factors that might influence the occurrence of both violence and migration. We first present the technical details of this model and later discuss the identifying assumptions and possible concerns with this approach.

For each value of k from 1 to 120, we estimate the following regression:

$$g({\mathbb{E}}({Y}_{dt,k}| {X}_{dt},{T}_{d,t+\tau }))={\gamma }_{d}+{\lambda }_{t}+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{\tau }{T}_{d,t+\tau }$$
(1)

where d indexes the district, t indexes the time (calendar date) and covariates Xdt are given by district fixed effects, γd, and time fixed effects, λt. Td,t+τ are the ‘treatment’ variables (whether or not violence occurs) in district d at time t, at a lag of τ days. Lags of τ [−30, 180] are used, representing violence in the district 30 days in the future to 180 days in the past. This range was chosen because all effects were observed to lie within this window; the results are insensitive to a longer window, while shorter windows are unable to capture all effects of interest. The outcome variable, Ydt,k, is the proportion of those in district d at time t − k that are in a different district at time t. Subscribers present k days ago in district d but with a missing location on day t are included in the denominator but not in the numerator in this computation. The parameter k is introduced to capture the fact that displacement has to be measured relative to some time in the past. g() is the logit link function. Since the outcome variable is a proportion, we model it using a beta distribution, a family of continuous distributions in the interval from 0 to 1, taking a variety of possible shapes depending on the values of its parameters. We fit a beta regression using maximum likelihood estimation93. Standard errors are clustered at the district level.

These coefficients can be interpreted as with a logistic regression: for each τ, \({{\mathrm{e}}}^{{\beta }_{\tau }}\) is the multiplicative change in the odds of being in a different district today (time t), for Tτ = 1 (when violence occurs) relative to Tτ = 0 (days without violence), holding the other variables constant. To interpret βτ as the causal effect of violence on displacement, the target parameter is the causal conditional odds ratio, and the set of necessary identification assumptions are positivity, consistency, conditional exchangeability and correct model specification94. In our context, this specification assumes that there are no spatial spillovers, meaning that violence in one district does not have an effect on displacement in other districts. Carryover effects of the violence are limited to 180 days after the violence, and effects from up to 30 days prior are allowed. These daily effects are estimated independently and do not modify one another. The effects are assumed to be identical for all districts and to not vary over the measurement period (2013–2017). The confounders are limited to district, time and treatment in the surrounding window of time, and these enter additively. This implies that there are no unobserved time-varying confounders and that past outcomes do not affect current treatment (this is plausible since in most cases the number of displaced people is not large enough to affect military strategy). We relax several of these assumptions in subsequent analyses—for instance, by allowing for heterogeneous effects of different types of violence in different types of locations.

The key identifying assumption that there are no unobserved time-varying confounders requires the precise day in which a district experiences violence to be random, after conditioning on district and time fixed effects and the occurrence of violence in the surrounding window of time (equation (1)). Qualitatively, we find this assumption plausible because the precise timing (that is, the day on which violence occurs) of insurgent attacks is often meant to surprise government forces. However, the assumption cannot be tested directly; we therefore perform several checks to assess whether the occurrence of violence can be predicted beyond what our model in equation (1) captures. Specifically, we first regress the occurrence of violence on day t on the control variables in equation (1)—that is, \(g({\mathbb{E}}({T}_{dt}| {\gamma }_{d},{\lambda }_{t},{T}_{d,t+\tau }))={\gamma }_{d}+{\lambda }_{t}+{\sum }_{\tau \in [-30,180]\backslash \{0\}}{\beta }_{\tau }{T}_{d,t+\tau }\)—and obtain the residuals \({T}_{dt}-\hat{{\mathbb{E}}}({T}_{dt})\). Supplementary Table 1 assesses whether these residuals can be predicted using recent lags and trends in the outcome variable (30-day displacement) and the number of subscribers observed to be in a district. We find that, using either a linear model or a machine learning approach (a random forest with tenfold cross-validation), these characteristics do not accurately predict residual violence (R2 ≤ 0.00028 (95% CI using non-parametric bootstrap, percentile interval, (0.00026, 0.0010))). Finally, as an additional robustness check, we find that adding more restrictive region × month time-varying fixed effects to equation (1) does not qualitatively change the main results (Supplementary Fig. 6).

In estimating these regressions, we exclude district-days in which the outcome variable is 0, 1 or missing. The rationale is that these zeros and ones are probably due to data sparsity. On one hand, if no subscribers were recorded as being in a different district, it could be that their locations were simply missing (for example, they did not use their phones, there was no cell service or they switched providers). On the other hand, it is unlikely that all subscribers would have left a district on any day; a recorded 1 could indicate cell tower outages in the origin district, for example. (News reports have described the Taliban restricting access to communications or destroying cell towers, and we do see a small reduction in the number of active cell towers in a district during periods of violence. However, we do not see significant decreases in call volumes at a district level, nor do we see a decrease in the probability of a district having an active tower, probably indicating that individuals are able to connect to a different cell tower within the same district. If it is the case that all individuals are only able to connect to a cell tower in a different district, our response variable would be a 1 and hence dropped from the regression. This limits overestimation of the displacement response.)

Several other points are of note. First, this estimation of displacement as an increase in migration due to violence also partially addresses the concern that the place of usual residence might be incorrectly measured using CDR. If violence does not impact the measurement error (for example, if the likelihood of a subscriber being misallocated to the district of their workplace instead of their home does not change due to violence), then the misallocation will not bias the estimated displacement. Second, although the treatments (violent events) occur relatively infrequently, the statistical model we employ is robust to sparsity; if all of the events are recorded accurately, estimates will not be biased because of sparsity. Non-random missingness of recorded events could bias estimates and are discussed in ‘Limitations’.

Summary of identification strategy

To more plainly summarize our statistical approach to measuring the effect of violence on displacement, we regress out-migration in each district-day on indicators for occurrences of violence up to 180 days prior and up to 30 days in the future, while controlling for geographic and temporal factors. This approach is designed to capture out-migration in excess of the out-migration that normally occurs in that district (on all other days) and on that day (in all other districts). Thus, the model does not assume that people do not move when violence is not occurring; instead, it uses movement in non-violent times and places as a baseline, to better isolate the additional movement that co-occurs with violence.

We include 180 lag terms and 30 lead terms to measure excess out-migration (again relative to normal out-migration) that occurs in the 180 days after violence and in the 30 days leading up to violence, as well as excess out-migration on the day of violence. Since violent events may be spatially and temporally correlated, a single observation (district-day) in the regression could have multiple violence indicators that are turned on; the migration dependent variable for that observation would thus contribute to the estimation of violence effects on all of the affected leads and lags.

Using this regression framework allows us to estimate the ‘average displacement effect’ of violence, averaged over the 3,354 violent events in our dataset that occur in districts on days with recorded mobile phone activity. For example, a coefficient of 0.03 on the indicator for violence at a lag of ten days can be interpreted as ‘On average, violence occurring ten days prior increases the odds of migration out of a district by 3% (a multiplicative change of e0.03 ≈ 1.03), holding all other variables constant.’ This approach helps limit the extent to which any one specific event, which might have unusual characteristics or correlates, can influence our final results. For instance, if one violent event happened to occur on a day in which a certain district would have seen unusual out-migration even in the absence of violence, that single event would have a limited impact on our final estimates. The main concern is if violent events were systematically correlated with other unobserved factors—above and beyond the flexible spatial and temporal fixed effects that we control for in the regression.

Impact of a violent day

To distil the impact of a single violent day, for each k [1, 120], we consider the coefficient for Tτ for τ = k. This coefficient captures the effect of violence occurring at a τ day lag, on movement measured at time t, compared with district locations k days ago. When τ = k, the outcome variable is measured with respect to those in the district on the day of the violence. In this way, extracting the relevant coefficients from regressions where the outcome variable is different values of k gives us the impact of a single violent day, on the subscribers in the district on that day. We demonstrate the robustness of these results to potential data issues, such as the presence of outliers, as well as modelling issues such as the inclusion of additional time-varying controls, in Supplementary Figs. 5 and 6.

Heterogeneous effects

To allow for the possibility that the displacement response may differ for different types of violence or for types of locations, the results of heterogeneous effects models are shown in Fig. 3. These results are estimated by creating separate treatment indicators for different types of events (for example, low-casualty versus high-casualty), which replace the treatment indicators in equation (1). For instance, letting Hd,t+τ denote the occurrence of high-casualty (>10 casualties) violence and Ld,t+τ denote the occurrence of low-casualty violence, we estimate:

$$\begin{array}{l}g({\mathbb{E}}({Y}_{dt,k}| {X}_{dt},{H}_{d,t+\tau },{L}_{d,t+\tau }))\\={\gamma }_{d}+{\lambda }_{t}+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{H,\tau }{H}_{d,t+\tau }+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{L,\tau }{L}_{d,t+\tau }\end{array}$$
(2)

When analysing the heterogeneity of response by location (for example, for provincial capitals), we estimate prior regressions on the relevant subsets of the data—that is, by only including observations pertaining to provincial capitals.

Controlling for multiple dimensions of heterogeneity

To account for multiple dimensions of heterogeneity varying jointly, we analyse 30-day displacement by first fitting equation (3) separately for each of the events, using ordinary least squares:

$${\mathrm{log}}\left(\frac{{Y}_{dt,30}}{1-{Y}_{dt,30}}\right)={\gamma }_{d}+{\lambda }_{t}+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{\tau }{T}_{d,t+\tau }+{\epsilon }_{dt}$$
(3)

Here Td,t+τ indicates a single event at a time (each treatment indicator indicates whether or not the specific event occurs at district d at time t, at a lag of τ days). Only events in which all βτ coefficients can be estimated are included, meaning that if the outcome variable is unavailable in any day that is 30 days preceding the event to 180 days after the event, it is not included in the analysis. This results in a total of 2,359 events being studied. For each included event, we take the mean of the estimated coefficients for βτ, for τ = 1–15, 16–30, 31–45, 46–60, 61–75 and 76–90. We treat these as outcome variables and model each of these derived outcomes Oi as

$$\begin{array}{ll}{O}_{i}&={\beta }_{0}+{\beta }_{1}{{\mathrm{provCap}}}_{i}+{\beta }_{2}{{\mathrm{log(population)}}}_{i}\\ &+{\beta }_{3}{{\mathrm{IS}}}_{i}+{\beta }_{4}{{\mathrm{casualties11}}}_{i}+{\beta }_{5}{{\mathrm{peace60}}}_{i}+{\epsilon }_{i}\end{array}$$
(4)

where i is the event, provCapi is a binary variable denoting whether the event occurs in a provincial capital, log(population)i is the log of the population of the district in which the event occurs (added as a control), ISi is a binary variable denoting whether the event involved IS, casualties11i is a binary variable denoting whether the event was associated with 11 or more casualties and peace60i is a binary variable denoting whether the event was preceded by 60 or more days of peace. Figure 4 shows the estimated coefficients for each of the outcomes.

Destinations of displaced people

To investigate where the individuals displaced by violence go, we first examine migrant flows during non-event days (Fig. 5) and event days (Supplementary Fig. 7). We consider all recorded moves in any 30-day period and split these into days on which violent events occurred at the start of the 30-day period (‘event days’) and those on which no events were recorded (‘non-event days’). We repeat the following analysis for each. First, we categorize recorded moves as originating in either capital districts or non-capital districts. We then split destination districts into mutually exclusive categories by first recording whether they are in the same or a different province from the origin district; these destinations are then partitioned into three different types of districts—the major urban cities (Kabul, Kandahar, Hirat, Mazari Sharif and Jalalabad), other capital districts and non-capital districts.

To estimate the effect of violence on the destination of displacement, we use a similar setup as equation (1). Instead of the outcome variable being the fraction of the population that moved on day k, we use the fraction of movers (those in a different district at time t compared with k days ago) on day k observed to be at specific types of destination districts, as described above. We use outcomes for k = 7, 30, 90 and fit separate regressions for provincial capitals, for non-capitals and for each outcome. As before, district-days in which the outcome variable is 0, 1 or missing are excluded from the analysis.

Implications of missing violence data

As discussed in ‘Limitations’, our analysis does not include violent events that are not associated with specific locations (that is, where we do not know the district in which the event occurred). This could introduce bias into our analysis if certain types of violence (with specific migration responses) are systematically more or less likely to have known locations. We therefore conduct additional analysis to determine whether the spatial precision with which an event is recorded is correlated with the magnitude of the displacement effect.

Specifically, using the same empirical approach described in ‘Heterogeneous effects’, we create separate treatment indicators for each of the three types of events that we use in our analysis, based on their available geographic precision: (1) events for which the exact location is known and coded, Ad,t+τ (N = 1,698); (2) events that occurred within a 25 km radius around a known point, Bd,t+τ (N = 789); and (3) events for which only the district is known, Cd,t+τ (N = 969). These replace the treatment indicators in equation (1):

$$\begin{array}{ll}&g({\mathbb{E}}({Y}_{dt,k}| {X}_{dt},{A}_{d,t+\tau },{B}_{d,t+\tau },{C}_{d,t+\tau }))\\ &={\gamma }_{d}+{\lambda }_{t}+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{A,\tau }{A}_{d,t+\tau }+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{B,\tau }{B}_{d,t+\tau }+\mathop{\sum }\limits_{\tau =-30}^{180}{\beta }_{C,\tau }{C}_{d,t+\tau }\end{array}$$
(5)

The results, shown in Supplementary Fig. 8, indicate that the displacement response is very similar for violent events with these three different levels of spatial precision. There are small differences in the point estimates, but the general pattern of the response is unchanged, and the CIs of all three violence types overlap. Of course, this analysis does not eliminate the possibility that there might be a qualitatively different displacement response to violent events that are not recorded in our dataset (or for which district information is unknown). Unfortunately, we cannot directly test that concern, since we cannot estimate the displacement effect of violence when the location of the violence is not known.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.