The way in which data on conflict violence is collected can not only lead to severe underestimation of the human toll of conflict, but also to misinterpretation of trends in conflict violence, says Megan Price.
Networks of citizen journalists and human rights activists in Syria are conducting the dangerous work of recording the human toll of the ongoing conflict. Their work is invaluable. But it is just a first step, and as data scientists, it is important that we recognize the need for appropriate statistical analyses to make the most of the information collected by these brave people.
Multiple organizations are collecting information about victims who have been killed in the conflict. The general methodology is similar across groups — most maintain a trusted network within Syria, which provides new information collected from primary sources such as victims' families and community, religious leaders, hospital and morgue records, as well as confirmation of information gathered from outside sources, typically mainstream and social media. The information that these networks are able to collect is affected by any number of factors. To name a subset: certain regions are more or less accessible, different communities have different levels of trust in the different networks, and as the security situation improves or disintegrates, networks expand or shrink. This is not a criticism of the work these organizations do. Rather it is a call to recognize these challenges and appropriately account for these limitations in the analysis and interpretation of the records that these organizations collect.
Collecting information by gathering what is observable, given the resources and security situation, constitutes what statisticians call a convenience sample. While far from convenient to collect, this terminology is used to distinguish such datasets from those that are collected using an underlying probabilistic mechanism. In other words, from data collected via other methods, such as randomized household surveys, retrospective mortality surveys, or censuses of refugee camps. From an analytical perspective, the difference between these two approaches is that data collected using an underlying probabilistic mechanism provides a way to calculate the proportion of the total population that was captured in the collected data (the sample) and the probability of inclusion for each element in the sample. In contrast, data from convenience samples contain an unknown proportion of the total population. We do not know the probability of selection for those whose information was collected. For the case of casualties in Syria, we do not know what proportion of all victims are included in each list, nor, based only on the existing lists, how many victims are as yet undocumented and unidentified.
Lists of named dead collected by different organizations are extremely valuable sources of information about individual victims, but are not appropriate, on their own, for drawing conclusions about patterns of violence. Lists of victims tell individual stories and hint at the scale of violence perpetuated in conflict. However, patterns derived from lists conflate the true underlying dynamics of violence with our ability to observe and record that violence.
This distinction is crucial. It is natural to ask questions about patterns of violence: did violence increase or decrease in Hama as territorial control changed hands from the regime to the opposition and back again? Did violence decrease following the ceasefires brokered by the United Nations? It is tempting to answer these questions by looking at patterns in the observed data, but this means treating convenience samples as if they are probabilistic samples and thereby assuming the reported patterns of violence represent the underlying patterns of violence. This is a very strong, rarely met, assumption. Worse, this assumption runs the risk not just of underestimating the true rates of violence but of drawing the wrong conclusion about the direction of patterns of violence. This means we risk getting the answer to the above questions wrong. Those questions are too important to get wrong simply because we failed to use the proper analytical techniques and interpretations.
For example, we know that more victims were reported to documentation groups in Hama in December 2012 compared with January 2013. But our statistical estimates of the underlying total number of victims (which include both those observed and unobserved) indicate that violence peaked in Hama in January 2013, with far more victims in that month than the preceding month. The observed, reported pattern of violence indicates a decrease from December 2012 to January 2013, whereas it is likely that the true underlying pattern of violence increased during that time period. This is not a shortfall of the documentation groups — we frequently find that precisely when violence spikes it becomes impossible to document, and therefore it is most important during those time periods to rely on analytical methods to fill in the gaps of what is observable.
Appropriate analysis and interpretation of convenience samples have been well-established within the scientific community for over a century, and statisticians have been warning human rights practitioners about the problems with raw data for over 20 years. All fields face problems with convenience samples: in the era of big data, people want to use it. This makes sense, and it is appealing to use data for a decision, rather than intuition. Nonetheless, reading statistical patterns directly from raw data is unlikely to lead to the right answer.
There are a number of options available to data analysts in these situations. It is true that in many conflict settings it is unrealistic to collect a probabilistic sample, much less a census. However, this does not mean we are stuck naively analysing and interpreting convenience sample data. For example, at the Human Rights Data Analysis Group we use a method called multiple systems estimation (also referred to as capture–recapture) to model the data collection process and estimate what we do not know, what we have not been able to observe and record. Through this estimation process it is possible to generate data that are appropriate for interpretations of patterns of violence. We have produced preliminary estimates for specific time periods in Syria and hope to have a wider set of estimates available in 2017. These estimates will enable us to answer the kinds of questions posed above — questions that can help drive policy decisions, evaluate intervention strategies, allocate resources, and, ultimately, determine accountability for violence.