Introduction

The social media revolution

The use of social media completely revolutionised the way in which information is now shared and consumed, and is now a relevant part of government agencies and companies (Kaplan and Haenlein, 2010). Social media has given its users the ability to share content and opinions without having to depend on traditional and centralised news media outlets, potentially obtaining a more democratic distribution of opinions, offering users the ability to reach a large proportion of the population (Kwak et al., 2010).

Data collected from social media is a valuable input to analyse the flow of information, opinions and sentiments, and by detecting who shares what and how frequently. Millions (or perhaps even billions) of posts or tweets have been used to detect social media activism (Xu et al., 2014), to assist emergency responders (Avvenuti et al., 2016, 2018), to analyse the spread of a disease (Lampos and Cristianini, 2012), to detect the role of different users in the network (Martinez Teutle, 2010) and their behaviour (Cresci et al., 2020; Mazza et al., 2019), to quantify media coverage (Prieto Curiel et al., 2019), to provide indications for tourists (Barchiesi et al., 2015a, b; Cresci, 2014; Muntean et al., 2015), to detect road traffic (D’Andrea et al., 2015), exposure to cross-ideological contents (Himelboim et al., 2013), access to political information (Himelboim et al., 2013) and political participation (Ausserhofer and Maireder, 2013), perception on social phenomena such as migration flows (Coletto, 2017), and even to detect the popularity of different types of food (Amato, 2017) and to construct a real-time measure of happiness or hedonometer (Dodds et al., 2011). Although most of what is shared in social media are not news, nor posts related to public issues, it has nonetheless become, for some, one of the main sources of political information and news (Gil de Zúñiga et al., 2012).

However, is the information from social media a fairly good representation of reality? Does the fact that “everyone can post something” on Twitter or Facebook translates into a more accurate description of reality than the traditional media? How to measure the bias that exists in social media?

Arguably, there is no other type of social event, which is both regularly and rigorously measured and, at the same time, promotes traditional media and users to post comments on social media as crime is. Here, we use crime as the object to quantify social media and to link it with traditional media and with the observed reality. Crime itself is a complex phenomenon with many unexpected social behaviours, which are difficult to understand, control and, sometimes, even to quantify (Helbing, 2015). For example, enforcing longer prison sentences or harsher punishments might not decrease crime rates (Becker, 1968); allocating more police might not reduce crime via deterrence (Kleck and Barnes, 2014), and a city with higher levels of crime might be perceived as being more secure (Prieto Curiel and Bishop, 2016). Crime and fear of crime have many non-linear elements (D’Orsogna and Perc, 2015; Gordon, 2010); and so here, we collect millions of tweets and detect whether crime is expressed in social media, we analyse which types of account publish crime-related posts and quantify whether more crime or more fear of crime at a city and country level are correlated with more expressions of crime in social media, and we compare the amount of crime that is expressed in social media with mass media.

Our results show that crime rarely appears on social media and it is usually mentioned by only a few users. Tweets about crime come mainly from news agencies and reporters, and exhibit the same reporting biases as the mass media towards violence, sexual crime and murders, and this almost mimics what was observed on mass media, such as printed newspapers.

We then compare the quantity of social media expressions of crime with observed trends of crime and fear of crime. Although there is a significant time discrepancy between the crime data and the collected social media posts, assuming that high-crime areas remain with a similar insecurity problem for some time, and that regions perceived as being insecure also remain with the same problem, results show that there is little correlation between crime and social media posts, but social media posts are related to the fear of crime of a region.

Traditional media miss-represents criminal reality

To be newsworthy, social events must capture the attention of the viewer/reader and so they have to be rare, or timely, or unexpected, or alternatively have some special significance (Chermak and Gruenewald, 2006). However, the majority of crimes do not have these attributes with many being merely attempted crime or do not have serious consequences (Skogan, 1987). Traditional media gives a distorted version of the crimes within a city with a significant bias towards violence (Hollis and Downey, 2017).

As a result, less than 1 out of 400 crimes is featured in the traditional news (Chadee and Ditton, 2005), and these are usually crimes with violence or with a sexual component, outweighing non-violent crimes, even though non-violent crimes are much more frequent (Ditton and Duffy, 1983). For instance, considering the ten most popular printed daily newspapers published in the UK for a period of 4 weeks in 1989, it was found that nearly 65% of the space that was devoted to crime was related to personal violence, whereas official statistics reported that only 6% of crimes involved violence (Dickinson, 1993). Similarly, taking 25 editions of newspapers in cities of the USA, it was found that nearly 30% of crime stories were murders, where in fact only 0.02% of the crimes are murders (Liska and Baccaglini, 1990).

Is crime also miss-represented in social media?

There are significant differences between what is published in traditional media, and what is posted on social media. For instance, traditional media typically covers major disasters in more depth than social media (Olteanu et al., 2015). In terms of crime, social media could provide a more accurate description of the crime suffered in a country or a city than traditional media does. Victims, indirect victims, and witnesses might be inclined to share their thoughts after experiencing a crime (Cresci et al., 2018), regardless of whether that crime was minor or not, so that social media might be able to provide a more accurate version of the criminal reality of a city. Leaving aside the potential readability issues (Temnikova et al., 2015) and fake news (Del Vicario, 2016; Mendoza et al., 2010), social media could potentially provide a powerful tool for detecting crime trends and patterns.

Social media could also provide a useful source of information for detecting and measuring fear of crime. Although a clear relationship between crime and its fear is perhaps expected, there is often a mismatch between fear of crime and the actual crime suffered in a city (Prieto Curiel and Bishop, 2016; Skogan, 1987), so that people often fear crime even if they are immune to suffering any (Prieto Curiel and Bishop, 2017). Traditional ways of measuring fear of crime frequently depend on costly victimisation surveys (Carro et al., 2010; Ferraro and Grange, 1987; Hale, 1996), which have a considerable time delay between the date when the study is conducted and when the data is available for analysis. However, with social media, we could obtain an almost immediate reflection of the fear of crime in a city. Users of social media might express their concerns and fears of crime more frequently in a more dangerous city, enabling us to understand how we arrive at our perception of security (Kounadi et al., 2015).

Therefore, social media could be a powerful tool for measuring crime trends and patterns but only if it is, in fact, related to crime, or social media could help us understand the fear of crime and perceptions of security but only if it is, in fact, related to actual fear of crime. Also, social media could be used at a city level, both in terms of crime and fear of crime, if there are enough tweets related to crime and if they are meaningfully associated with the local problems of insecurity.

Other experiences using big data in crime analysis

There are successful attempts to unleash the “wisdom of crowds” expressed in social media in the context of crime analysis (Bendler et al., 2014), mainly with the idea of obtaining information about crime and fear of crime from new sources of information (Solymosi and Bowers, 2018). Some platforms have been developed exclusively to crowdsource information about crime, for instance, Postacrime.com, CityWatch (Cvijikj et al., 2015) or a fear of crime application (Solymosi et al., 2015).

However, most of the data from social media is produced as a digital byproduct of a platform, such as Facebook or Twitter (Hilbert, 2016), and so applications of social media to crime analysis face some serious challenges. Firstly, not everything posted on social media is true (Mendoza et al., 2010); secondly, posts might be difficult to understand due to the use of abbreviations, hashtags, typos, or the lack of connecting words and more (Temnikova et al., 2015). Also, although social media posts offer a rapid distribution of information, it was found that there is frequently a delay, of possibly several months, in the case of crime-related posts in social media (Kounadi et al., 2015), so that, in fact, the most up-to-date posts are not necessarily a reflection of the current crime and security situation of a city.

What information has been extracted from big data in the context of crime

Firstly, the location of published content in social media. Without analysing the text contained, the density of tweets per unit of area has been used to identify the correlation between Twitter users and crime rates (Bendler et al., 2014), and for detecting risky areas on a city, showing that indeed hotspots move when the density of tweets is considered (Malleson and Andresen, 2015). Social media is used as a proxy for a mobile population density.

Secondly, by analysing the contents of the published messages. For instance, by looking at topics on tweets, the prediction of the locations of certain types of crimes is claimed to be improved (Gerber, 2014) with even further improvements, if the weather conditions are included in the model (Chen et al., 2015). By detecting whether a tweet contains indicators of disorder or decay (with words such as “dumping”), Twitter data increases the variance explained in crime estimation models (Williams et al., 2017). Reports to fix street disorders, such as graffiti, were successfully used as signals correlated with fear of crime (Solymosi et al., 2018). Also, by detecting tweets which contain at least one violence-related word, the attitudes of the audience around stadiums were analysed (Ristea et al., 2017). In a similar manner, the tweets from a news agency were analysed to “predict” hit-and-run crimes in a city in the USA (Wang et al., 2012).

Thirdly, the structure of the network, who follows who, who retweets who, or who was nearby, has revealed structures of organised crime, the role they might play and which actors play a central role a corruption network (Requião da Cunha, 2018).

Finally, social media has also been used to analyse the reaction of the audience to a specific incident. For instance, in the case of the Paris attacks in 2015 and after several major earthquakes, it was detected that event-related tweets triggered more retweets than those expressing support, possibly due to a higher information content (Avvenuti et al., 2017; Cvetojevic and Hochmair, 2018), and also tweets with images received higher retweets.

Other relevant uses of big data and social media in crime analysis and prevention have been implemented, for instance, using data from Foursquare (Kadar and Pletikosa, 2018), and for fraud detection (Bello-Orgaz et al., 2016) or to detect the marketing and distribution of illicit drugs (Watters and Phair, 2012) or for crowdsourcing information related to specific incidents, such as the Boston Marathon bombing (Markowsky, 2013). Also, by looking at tweets related to 98 homicides in London, it was detected that nearly half of the users are within a few kilometres of the location of the incident, but also <10% of those tweets are published on the actual day of the murder, and so there is a considerable time delay, often, months, between a specific crime and tweets related to it (Kounadi et al., 2015).

Although there are many more applications of big data and social media in crime science (Chan and Bennett Moses, 2016), and more of them will appear as new platforms are developed, there are certain challenges about establishing the usefulness of social media in a crime setting.

Social media is not a fair representation of the whole population

Twitter is incredibly useful for spreading information or personal views but it is not perfectly representative as it does not provide a random sample of the population since not everyone uses the service (for example, <10% of the population in the case of Mexico) plus, even of those that do use it, there is likely to be an age bias towards the 25- to 45-year-old age group. What is more, among those who do use Twitter, just gleaning the number of tweets on a particular topic does not provide a random sample of opinions from its users since some accounts post dozens of tweets per day, while others post with a much lower frequency. Moreover, users with the highest fear of crime, or involved users who are promoting a particular view, are more likely to post tweets about crime.

Also, it is important to note that some tweets might reach a larger audience than others since accounts run by organisations have greater numbers of followers (Mendoza et al., 2010). Hence it is likely that merely measuring the number of tweets that include an expression of crime or fear, alone and weighted equally (an approach, which has been useful elsewhere), is not representative of actual crime taking place. Bearing in mind that some information promulgated by tweets might not even be true (Del Vicario, 2016), this again reinforces the fact that the collection of tweets is not necessarily a reflection of reality and do not express the “true picture”.

Data and methods

The objective here is to detect expressions of crime and fear of crime in social media, measure them and compare them with the actual crime suffered. Three sources of information are used for such comparisons: Twitter data, crime data and fear of crime data.

Data from Twitter

Data from Twitter users in 18 Spanish-speaking countries in Latin AmericaFootnote 1 were collected with this choice of target group selected for several reasons. Firstly, there are roughly 400 million people who have a shared Spanish language among which there is a high level of internet usersFootnote 2. Secondly, the data spans a number of countries allowing sufficient breadth so that we can compare one country with another, where there might be different social norms and crime rates, and additionally ensure that the data will not be dominated by local trends. And finally, crime is one of the most relevant problems suffered by the whole region, with 42 of the world’s 50 most violent cities located in Latin AmericaFootnote 3.

Twitter posts

The data used in this study were collected using Twitter’s Streaming APIFootnote 4 between May 22 and July 30 2017 (70 days). During this time, all geo-located tweets shared from within the 18 Spanish-speaking countries from Latin America were initially retained, while retweets were not considered. This resulted in a dataset of 32,513,684 distinct tweets, which will be later classified, 27% of them shared from Mexico, 23% from Argentina, 12% from Colombia, and the rest from smaller countries.

Whenever possible, the city from which the tweet was published was also recorded so that 64 of the larger cities could be considered separately. As a result, we obtained 2,678,783 tweets (8.2% of the total) with a city level geographic resolution.

Identifying and classifying crime-related tweets

To identify that a tweet was crime-related, an extensive list of 392 words and hashtags consisting of terms related to crime, organised crime, or fear of crime was constructed, in an incremental way, by reading tweets and news related to crime. The list includes gender and singular-plural variations, such as “murder”, or “stolen” and other words, which have a crime connotation. In the list (see the Supplementary Information), 274 are in Spanish and the rest are in English (Prieto Curiel et al., 2018a).

Each word in the list was then assigned to different crime categories, defined as “violence-related”, “property-crime-related”, “organised-crime-related”, “sexual-crime-related”, “murder-related” and “gun-related”, where some words can be assigned to more than one category.

If a tweet contains any of the 392 words related to crime, the post is considered to be crime-related. In addition, if the tweet contains a word from a specific category, then the tweet is also considered to be part of that category. Notice that categories might have a considerable overlap or might even be fully contained, for instance, murder-related tweets are fully contained in the category of violence-related tweets.

A similar technique, detecting the use of words such as “food” or “wedding” allowed the construction of a real-time measure of happiness (Dodds et al., 2011), which showed weekly and daily cycles of happiness. In terms of crime, detecting whether a tweet includes words such as “violence”, was used to classify tweets as “crime-tweets” (Ristea et al., 2017).

Not all “crime-related” tweets are related to crime

The list of words and the methodology was designed thinking of ways in which it could be used in real-time, for instance, by the Police Department. There are two types of error that could emerge by classifying tweets based on whether or not they contain a word from the list: identifying a tweet to be related to crime when it is not, and not classifying a tweet to be related to crime when it was.

The second type of error, which are tweets meant to be related to crime but that did not include words on the list, is relevant (and perhaps worse than the other type of mismatch), as the emergency would not be dispatched. This type of error represents all expressions of fear of crime, call for emergencies and crime reports, which the Police would ignore with our method. However, by constructing an extensive list of crime-related words, this type of error is considered to be kept to a minimum. Also, in case of a real-time application, adding some words to the list (for example, the name of a newly organised crime group) would keep this type of error in a minimum level.

The first type of error (classifying a tweet as crime-related when it has no relationship with crime) has little impact in terms of a real-time application, as the Police Department would simply not dispatch units, not follow up on the tweet and manually read that it might be related to football or anything else (similar to a fake emergency call to the 911). We use this type of error to measure the precision of our method.

Some tweets might contain a phrase such as “I would kill for a holiday right now” or similar expressions that are not actually related to crime or to fear of crime. To measure the frequency of this type of mismatch, 3000 crime-related tweets from different countries were individually read and manually analysed to determine if the post was actually related to crime or not. The manual classification of the 3000 tweets also required website visits if the post included a link or a visual inspection of the media included (such as photos or videos). The 3000 annotated tweets were randomly picked across different dates, places, and for different categories of crime and the false-positive rate was analysed.

Roughly 66% of the 3000 crime-related tweets were manually found to be related to crime, or an expression of fear of crime, or a demand for justice or security. While this could be tested for different cities or countries or, for instance, times of the day or days of the week, it is a very time-consuming test, which would help us identify some words that could be dropped from the list, since they are more frequently used in a non-criminal context. Although a learning technique could be used, or a more sophisticated text-analysis technique, including n grams, for instance, for some tweets it is still not easy to determine whether or not the post was related to crime, even if it was manually read. Similar methods of validation have used a smaller number of annotations to check and quantify their results (Pak and Paroubek, 2010, 2011; Ritter et al., 2011) and here, instead of refining the technique to determine whether a person is expressing crime or fear of crime, we assume that this potential mismatch is observed for all tweets. With a 95% confidence interval, the error is 1.8%, meaning that roughly between 64 and 68% of all tweets identified automatically as crime-related tweets are actually associated with crime.

It is assumed that two-thirds of the crime-related tweets are actually associated with crime (and the other third is just an expression of something else), and this is assumed to be uniform across all tweets considered to be crime-related (Pak and Paroubek, 2010, 2011; Ritter et al., 2011).

Crime data at a national level

Unfortunately, the majority of crimes are not reported to the police and any cross-national comparison strongly depends on the definitions used for different types of crime. Therefore, murders (intentional homicides) are used for comparing between different countries since they are, perhaps, the most reliable way to compare the levels of crime. The most up-to-date data per country for the number of murders, as well as the number per 100,000 inhabitants, is published by the United Nations Office on Drugs and CrimeFootnote 5. The number of murders for the year 2015 is available for most of the countries analysed although, in some cases, it reflects data from previous years.

Fear of crime is also analysed, using the Latin American Public Opinion Project LAPOP Latin American Public Opinion Project (LAPOP) (2017), which includes a question Thinking of your daily life, how much fear do you have being a direct victim of homicide? Do you feel a lot of fear, some fear, little fear or not fear at all?, and answers are combined to produce a fear of crime index (Tseloni, 2007). More details about the LAPOP survey are available in the Supplementary information.

Crime data at city level

Comparing the actual crimes suffered in each metropolitan area to events expressed in social media at a city level is complicated with respect to both variables. Unfortunately, a comparison between cities from different countries is not possible due to the varying definitions of crime, ways of measuring crime, and significant, but varying, issues with unreported crime.

At a national level, there are some resources, such as victimisation surveys, but unfortunately, not all countries have them and it is often difficult to compare between two different surveys with different types of questions. At a regional level, the LAPOP survey is very useful, but it is not representative at a metropolitan level and therefore, it cannot be used either. For the social media posts, a more local dimension, such as cities, is complicated since only a small number of users publish their location. Also, it was found that the accounts that post their location might be local newspapers, which tweet about crime much more frequently than a general user would, and creates a strong bias in some cities.

Despite the aforementioned issues, here, a focus on Mexico is useful since good data is available to compare between the 23 metropolitan areas with at least 750,000 inhabitants. A victimisation survey from Mexico Encuesta Nacional de Victimización y Percepción sobre Seguridad Pública, ENVIPE (Encuesta, 2016) provides estimates for the crime rates suffered by the population of each city, divided into different types of crime, and provides a metric for the population who have fear of crime in their city and therefore, it allows us to quantify the level of crime and fear of crime in each of the 23 cities using the same definitions. The victimisation survey in Mexico is for 2016 so that again, the time intervals between the two data sources do not match. More details about ENVIPE are available in the Supplementary information.

Temporal miss-alignment of the data

Ideally, spatial and temporal alignments of the crime data and the social media data should be observed, but this is not the case.

One of the most difficult parts of the analysis is that there is a time delay between the moment in which tweets are posted, the moment in which crimes were suffered and the time in which the crime data (such as reports and surveys) is available. However, this should always be the case. For example, tweets were collected in London to detect how people reacted to a murder on social media (Kounadi et al., 2015) and it was detected that tweets were often published months after the crime happened while less than 10% of the tweets related to a homicide are published on the actual day of the murder, and so there is a considerable time delay, often, of months, from the social media data.

Also, from the crime data, victimisation surveys, which allow a full analysis of the suffered crime, have a delay of several months between the person suffered any crime and the time in which the data is available to conduct research. Thus, there is a delay in the tweets but also a different delay with respect to the crime data and this is a natural part of social media studies. The time delay is an unavoidable part of any analysis to compare and measure the attention that society puts on different topics and events.

However, although there is a significant and unavoidable temporal miss-alignment between the crime data and the Twitter data, overall trends for the level of crime in a country do not have drastic changes from one year to the next one (Prieto Curiel and Bishop, 2016), and a similar thing happens with the perception of insecurity, in which a region perceived as insecure, remains with the same problem even if the crime rates drop (Prieto Curiel and Bishop, 2017), so even with the temporal miss-alignment, crime and fear of crime data might still be considered as a proxy to determine a general level of crime and insecurity observed in a region.

Results

Crime portrayed on social media

Collecting 32 million tweets and then classifying them as being related to crime gives a quantification of crime-related topics in social media. This quantity is considered per city and per country, and it is analysed by the proportion of tweets for comparability purposes.

The amount of crime-related tweets and other categories (violence-related, murder-related, property crime-related, etc) is compared to crime data and fear of crime data from each region (city and country). Although we do not link a tweet with a specific crime, here the analysis is conducted with respect to the frequency of crime-related tweets, so that we analyse why a city has more posts related to crime than others.

Roughly 1.5% of the tweets are crime-related

Based on the approach outlined, from the 32 million tweets collected, 501,057 were deemed to be crime-related. Thus, 15.41 tweets out of 1000 tweets posted in the major Spanish-speaking countries in Latin America are considered to be related to crime. The aggregated number of crime-related tweets per city and per country is available in a public repository (Prieto Curiel et al., 2018b).

During the 70 days over which data was collected, there are 317.5 tweets posted every minute from the 18 countries considered and from these tweets, five are crime-related.

Violence is most common crime-related tweet

Turning now to the issue of which crime category of the tweets is more prevalent, the most frequent category is the one designated as violence-related tweets with 6.51 out of 1000 tweets (Fig. 1).

Fig. 1: Crime-related tweets per 1,000 in Latin America.
figure 1

The most frequent crime-related tweet is violence-related forming more than 40% of the collected tweets.

We found that violent-related tweets were 3.7 times more frequent than property-crime tweets. Also, murder-related tweets are 2.3 times more frequent than property-crime tweets.

Every minute, there are five crime-related tweets posted within the 18 countries in Latin America and every minute, there are two violence-related tweets.

Different countries have different expressions of crime in social media (Fig. 2). In Venezuela, nearly 40 out of 1000 tweets are related to crime, whereas Nicaragua, Panama, Bolivia and Costa Rica have less than 10 out of 1000.

Fig. 2: Crime-related tweets per 1,000 in 18 countries in Latin America.
figure 2

For each country, different categories of crime-related tweets are considered, including violence-related tweets, murder-related tweets and others. In Venezuela, 38.1 crime-related tweets per 1,000 are published, but the number is much smaller for other countries in the region.

Social media is just as biased as traditional media with respect to crime

In the tweets collected, 28.3% of the crime-related tweets are related to murder although murder accounts for only 0.072% of the crimes suffered in Mexico. Similarly, taking 25 editions of newspapers in the USA, it was found that nearly 30% of the crime stories were murder, but it represents only 0.02% of crimes (Liska and Baccaglini, 1990). Violent deaths, such as homicide, suicide or deaths by terrorism, are overrepresented in the newsFootnote 6 and they also overrepresented in social media.

Only a few accounts publish crime-related tweets

Most of the accounts do not post anything related to crime. In Latin America, 90% of the active usersFootnote 7 detected during 70 days did not publish anything related to crime or violence, so that only 10% of the accounts post tweets related to crime, varying from 2% in Cuba, 7% in Mexico and up to 20% in Venezuela.

There are just a few accounts that publish the majority of the crime-related tweets. A similar situation was encountered before with other types of social media, in which a few super-contributors publish the majority of the posts (Solymosi et al., 2018). The top 1% of the users post 61% of the crime-related tweets; the corresponding top 1% of the users post 62% of the violence-related tweets but the top 1% of the users (meaning, those who publish more frequently) only post 35% of the tweets. Thus, crime-related tweets have a much higher concentration than regular tweets. A way in which the over-influence (or the concentration) can be formally measured is with the Gini coefficient of the number of tweets, the Gini coefficient of the crime-related and the Gini coefficient of the violence-related tweets, which goes from 0.838 for all tweets in Latin America to 0.965 for crime-related tweets (Fig. 3). This means that indeed, crime is, in general, only mentioned by some accounts with a very low frequency and by very few accounts (news agencies, institutions, and involved users) with a very high frequency.

Fig. 3: Number of users, number of users who posted at least one crime-related tweet and concentration of tweets among users (divided by all tweets, crime-related and violence-related tweets) measured using the Gini coefficient of the number of tweets.
figure 3

The Gini coefficient is a metric frequently used to measure income inequality and it is obtained by comparing the observed distribution of wealth against the hypothetical case in which all individuals have the same wealth. Here, the Gini coefficient is computed by comparing the number of tweets per account against the case in which all accounts publish the same number of tweets. The lowest Gini coefficient means that more users engage with crime issues in their country.

Mass media has a strong influence on social media

Taking a sample of 100 crime-related tweets and manually labelling the types of accounts which post them, it is possible to classify and detect which users publish most of the crime-related tweets. Results show that 33% of the crime-related tweets are posted either by the Twitter account of a newspaper or by a journalist.

Also by looking at the accounts which publish crime-related tweets, accounts who engage, promote or complain about security-related issues in Latin America were detected and labelled as being “involved users”. They are not the individuals who suffer first or second-hand victimisation (so, not the victims or direct witnesses), but are individuals who protest against the levels of insecurity or a specific crime. Although there is no clear distinction between involved users and ordinary accounts, when the Twitter profiles were manually sorted by their type of engagement, an account was considered to be an involved users if: they follow government and institutional accounts, they follow newspapers and other sources of information and more importantly, if they engage frequently with different crimes or if they engage with different issues (including security, but also, gender equality, corruption issues, air pollution and climate change, among many others). For most of the users that were manually classified, their “involvement” was clear and constantly observed on their tweets and their engagement with other social media users. Involved users publish 22% of the crime-related tweets.

In most of the cases, when a specific crime is mentioned by an “involved user”, the tweet also contains a link to the information source and the majority referred to a traditional media website. Similar to what has been encountered in the political debate (Himelboim et al., 2013), traditional media has a strong impact on social media in terms of what is posted related to crime, either directly (33% of the crime-related tweets) or though involved users (22% of the crime-related tweets).

Besides 33% of the crime-related tweets being published by journalists and accounts, which belong to mass media outlets and 22% of the crime-related tweets published by accounts labelled as “involved users”, also 7% of the crime-related tweets are published by government accounts (either an account which belongs to the Police, the Mayor of a city or similar types of accounts). Therefore, only 38% of the crime-related tweets are posted by “regular” users (or not by involved users).

Social media against reality in terms of crime

Comparing the number of crime-related tweets with the observed levels of crime and fear of crime at country level, let us detect if there is any correlation between crime and what is portrayed on social media. Although there is a temporal misalignment of the data, there are some stable patterns in terms of crime and fear of crime. For instance, according to the World Bank, the intentional homicide rate (per 100,000 people) between 2007 and 2016 of Chile ranges between 2.5 and 3.7, the rate of Uruguay ranges between 5.8 and 8.5, the rate of Brazil ranges between 21.9 and 29.5 and the rate of Honduras ranges between 46.5 and 80.1Footnote 8 with no overlap between these four countries during a period of 10 years.

At a national level more murders means more crime-related tweets

The correlation between the number of crime-related tweets, violence-related tweets and murder-related tweets out of 1000 tweets posted, against the number of murders, considering the 18 countries shows a positive relationship between them so that countries with a higher number of murders are also expected to have more crime-related tweets (expressed as a linear model in Fig. 4). Results show that, in general, countries with a higher number of murders, murder rate and fear of crime are more likely to have crime-related, violence-related, and murder-related tweets.

Fig. 4: Observed correlations between the collected crime-related tweets and the crime and fear of crime at a country level.
figure 4

Crime-related, violence-related, murder-related and property-crime-related tweets per 1000 in the 18 countries in Latin America against the yearly number of murders, the number of murders per 100,000 (murder rate), the fear of crime index and the percentage of people with strong fear.

Broadly speaking, one murder in Latin America is associated with 8.4 murder-related tweets, 13.7 violence-related tweets, and 32.4 crime-related tweets. This is, perhaps, as expected, since it is likely that murders are covered by national news media outlets who then post the event on their own social media, which then might be reproduced by involved users, so that countries with more murders are also expected to have more tweets related to crime.

Note that the population size of countries varies considerably and, therefore, we also compare the number of murders per 100,000 people to take into account the impact of the size of the country (Fig. 4). The correlation between the number of crime-related tweets, violence-related tweets, and murder-related tweets per 1000 tweets posted against the number of murders per 100,000 people shows that countries with a higher murder rate are also more likely to observe more tweets related to crime, violence, and murder.

Venezuela is the country with the highest fear of crime (although not the highest number of murders or murder rate) and it is also the country with the highest number of crime-related tweets per 1000 and violence-related tweets per 1000 (Fig. 4). Also, Venezuela is the country with the highest percentage of users with crime-related tweets and the country in which crime and violence-related tweets are less concentrated (Fig. 3).

Social media against reality in Mexico

Considering only the tweets from Mexico, it is possible to compare the number of posts related to different types of crime with what is actually suffered in the country.

According to the Mexican victimisation survey ENVIPE (Encuesta, 2016), for every murder there are roughly 34 crimes with a sexual component (including rape, rape attempts, harassment, exhibitionism); 917 property crimes (including violent and non-violent crimes in which property is stolen from the victim, such as car theft, robbery of a person, burglary and others), and 1391 crimes including all types of crime, but this is far from what is portrayed on social media. Although there is a temporal misalignment between the posts on social media and the victimisation survey, assuming that the number of crimes remains roughly constant, results show that in Mexico:

  • there are 1.44 crime-related tweets for every 100 crimes suffered, regardless of whether or not the crimes were violent or with a sexual component;

  • there are 0.21 property-crime-related tweets for every 100 property crimes suffered in the country;

  • there are 1.41 sexual-crime-related tweets for every 100 sexual crimes suffered in the country; and

  • there are 567.5 murder-related tweets for every 100 murders suffered in the country.

Although the number of crimes has fluctuations, which also makes the above quantifications difficult to make, it does show the emphasis observed in terms of crime on social media.

Assuming that tweets are a direct response to a specific type of crime, results show that a crime with a sexual component is tweeted 6.6 times more frequently than property crimes are posted. Murders are tweeted 401 times more frequently than sexual crimes (Fig. 5).

Fig. 5: Observed frequency of different types of crime against crime-related tweets.
figure 5

Comparing different types of crime suffered in Mexico against the frequency at which they are tweeted shows that social media has a strong bias to sexual and murder-related crimes.

At city level, there are even fewer crime-related tweets

From the 32 million tweets collected, only 2.68 million are assigned to a city (8.3%), and from them, only 19,912 tweets are crime-related tweets, which represent 7.4 from every 1000 tweets collected at a city level. Talking about crime on social media drops from the 15.4 crime-related tweets per 1000 to less than half when users actually share their location on social media. In Mexico, for instance, there are 10.68 tweets related to crime per 1000 at a national level, but it drops to 5.98 tweets per 1000 when the user shares their location. In Colombia, the number of crime-related tweets per 1000 drops from 24.64 at a national level to 5.13 crime-related tweets per 1000 at a city level and in Venezuela, this drop is from 38.14 drops to 21.85. In every country, there is a considerable decrease of tweets related to crime per 1000 when posts are analysed at a city level, which means that users who share their location are less likely to post crime-related tweets.

There is a triple filter when the analysis is at a city level: only 1.5% of tweets are crime-related, then only 8.3% of the users share their location and then, the combined effect is even stronger (as observed in Colombia). Thus, at the city level, almost no tweets related to crime are collected. In 31 cities, less than one crime-related tweet is published each day. There are some cities, for instance, Cochabamba in Bolivia and Arequipa in Peru, where the number of crime-related tweets detected for a period of 70 days is only 2 and 5, respectively.

Crime-related tweets at a city level are very scarce, and in addition, there might be even more sources of bias. For the specific case of La Laguna, a metropolitan area in the Northern part of Mexico, with 1.2 million inhabitants, a media consortium was detected (including three different newspapers), which frequently shared crime-related tweets on their accounts (amounting to >95% of the total crime-related tweets from that city), and which also shared the location of their offices. Thus, a city could have a high number of crime-related tweets but mostly due to a few accounts, which publish their location. Since a similar situation does not happen in any other city and newspapers do not often share their location on every tweet, La Laguna was dropped from the statistical analysis to avoid having strong outliers (Fig. 6).

Fig. 6: Crime-related tweets per 1000 in the largest 23 metropolitan areas in Mexico.
figure 6

The number of crime-related, violence, murder, property-crime, organised-crime and gun-related tweets per 1000.

Crime at a city level

The correlation between the expressions of crime in social media, based on either the actual crime suffered or the fear of crime, shows that there is practically no relationship between the number of crime-related tweets per 1000 to either the suffered crime, property crime, perceived fear at a local level, at county level or at province level (Fig. 7).

Fig. 7
figure 7

Observed correlations between the collected crime-related tweets and the crime and fear of crime at a city level in Mexico.

There might be some apparent correlation between the victims of hard crime per 100,000 and the number of property-crime tweets per 1000 (Fig. 7); however, it is not the same type of crime. The number of murders and the number of murders per 100,000 has a small correlation with the number of murder-related tweets per 1000; also, the number of violence-related tweets per 1000 has a small correlation with murders or hard crimes per 100,000; the number of property-crime-related tweets has little correlation with the number of robberies of a person.

Although it would be possible to conduct a different statistical analysis, the scarcity of the number of crime-related tweets at city level is critical. The city of Puebla, Mexico, for example, has 2.7 million inhabitants and more than 3000 daily crimes according to the Victimisation Survey Encuesta (2016). Although their Police Department and local government might try to use Twitter for security-related topics, there are roughly 2.2 crime-related tweets posted every day in Puebla (less than one tweet per million inhabitants) and so even in the best-case scenario, less than 1 in 1000 crimes could be matched with a different tweet.

The number of property-crime-related tweets per 1000 is only loosely related to the hard crime rate observed in each city (where “hard crime” includes murder, kidnap and missing person). In 15 cities in Mexico, less than 5 property-crime-related tweets were detected during the 70 days of the data collection. Thus, at a city level, we are no longer looking at large amounts of data but we depend only on a few users posting even fewer tweets.

Unfortunately, at a city level, social media posts offer little information about the crime suffered or the fear of crime. Hence, forecasting crime, detecting hotspot patterns for policing, measuring fear of crime, activism or public opinions seems almost impossible at a city level using tweets.

Conclusions and discussion

This work addresses aspects of quantitative social science by analysing crime and fear of crime to highlight specific behaviour. The study was carried out via a comparison between reality and events and expressions posted on social media. An accurate description of the situation with respect to crime in a country, or a city, is provided by official data and victimisation surveys, the latter of which also gives quantitative measurements regarding the fear of crime.

Less than 1% of social media is related to crime or its fear

Considering tweets to be associated with crime simply because they include a crime-related word has both advantages and disadvantages. Firstly, since the list of words is long (so that almost every word that has any connotation or relation to crime is included), so the method should pick up almost all crime-related posts. However, not all of the crime-related tweets are, in fact, related to crime. Manually reading 3000 crime-related tweets established that nearly one-third of the so-called crime-related tweets are clearly not related to crime. Even with a manual inspection of the tweets, it is very complicated to determine in many cases whether a tweet is related to a crime or not.

Without performing a sentiment analysis on individual tweets (Pak and Paroubek, 2010; Vadicamo, 2017), and only considering them as “relevant” or related to a specific type of crime simply by the inclusion of specific words, meant that little more than 15 tweets per 1000 were considered to be crime-related. Most of the tweets collected (on average 984 per 1000) were not related to crime, but the crime-related tweets rate is different in each country: in Venezuela, 38 of their tweets per 1000 are crime-related, but in Bolivia, less than 8 tweets per 1000 are crime-related.

Crime-related tweets are (and will likely be) scarce

In Latin America, as little as 15 tweets per 1000 are crime-related. There are three possible reasons that might lead to more crime-related tweets in the future: either the network gains more users; or the current users tweet more frequently; or the same number of users, with a similar number of daily published posts, tweet crime-related words more frequently. However, no evidence currently suggests that any of these three facts could actually happen in the near future. The number of Twitter monthly active users has been almost constant for the past few yearsFootnote 9. Similarly, the social platform has today nearly 500 million tweets per day, but this number has remained constant since 2014Footnote 10. Therefore, unless the current number of users, posting tweets with the same frequency as before, decide to post more crime-related tweets (and displace other topics as a result), the number of crime-related tweets is not expected to have any significant growth. This last point is, among the three previously outlined, the one that is more likely to happen. In fact, “hot” topics on social media tend to change over time, as we have recently seen with issues such as migration and opinion manipulation that gained huge momentum. As such, it could happen that the interest of online crowds towards crime-related issues arises, or that exogenous factors (e.g., a more social communication strategy by law enforcement agencies) could result in a significantly greater amount of crime-related tweets.

Violence in social media

Very similar to what occurs with the traditional media, violence is the most frequent crime-related tweet, although it is not the most frequent type of crime suffered in the region. Traditional media shows a strong bias towards violent crime and the evidence here is that social media displays a similar bias towards violent crime, possibly even more pronounced, with tweets not reflecting the overall crime levels. For instance, in Mexico, a murder is 2653 times more likely to be mentioned on Twitter than a property crime is.

Involved users across different countries

A large part of the observed crime-related tweets is from users who might be termed “involved users”, who are likely to tweet frequently about crime and fear of crime. Results show that more murders in a country and a higher murder rate is associated with a higher number of crime-related, violence-related and murder-related tweets per 1000 (Fig. 4). Comparing the ratio between the weekly number of crime-related tweets and the weekly murder rate, and although a causal mechanism of whether a crime encourages users to tweet about crime is not detected here, it was found that in Uruguay and Chile, each murder is associated with more than 60 murder-related tweets, but the number drops to less than 3 tweets in other countries. A higher involvement in social media, measured as the ratio between the number of murder-related tweets and the murder rate, was found in countries with a higher gross domestic product (GDP) per capita.

Perhaps as expected, there are more crime-related tweets in countries with a higher murder rate. However, comparing again the ratio between the weekly number of crime-related tweets and the weekly number of murders, countries with a higher number of murders (not the murder rate) also have more crime-related tweets, meaning that population size matters. Consider, for instance, a country with the population size of Mexico (approximately 130 million people) and a country with the size of Uruguay (approximately 3.3 million people), then even if both countries suffered the same murder rate, indicating the same risk for their population, the larger country would experience 40 times more murders simply due to the difference in population size and therefore, would have more crime-related tweets as a result.

In terms of crime, social media is more useful at national level than at city level

The number of crime-related tweets per 1000 was shown to be correlated with the number of murders of the country and the murder rate. At a national level, a positive finding of our study reveals that the most significant correlation occurs between the number of crime-related tweets and measurements of the fear of crime. Crime-related tweets posted in one city are not necessarily about a specific crime in their city, but could also be about crime from neighbouring cities or the rest of the country. For instance, users in Mexico City might tweet about a murder in Tijuana, even if it is 2300 kilometres away. The closer the person lives to a murder, the more likely it is that they will tweet about it (Kounadi et al., 2015), but at city level, we could observe tweets refer to things from their city, but from outside their city as well, meaning that at city level, tweets might be less usefull.

Unfortunately, regarding city level analyses, social media posts seem to offer little information about the crime suffered or the fear of crime. Hence, forecasting crime, using tweets to detect hotspot patterns and for policing, measuring fear of crime, involvement or public opinions seems difficult at a city level using tweets. In summary, our results suggest that, regarding crime and fear of crime, only national conclusions are significant.

This finding is consistent with recent studies that highlight the scarcity of fine-grained geospatial information in social media data (Middleton et al., 2013). However, the situation might change in the future, for instance, due to advances in tasks such as geoparsing, which allow extracting geographic information also from social media posts that were not originally geotagged (Avvenuti et al., 2018). Moreover, our findings do not imply that social media data is useless for city level analyses regarding other issues. In fact, it already proved valuable for improving a number of city level tasks (Avvenuti et al., 2018; Cranshaw et al., 2012; Middleton et al., 2013).

Expressions of fear in social media

From a crime science perspective, the content of the tweets, and not just the location, as has been used previously (Malleson and Andresen, 2015), might still provide a valuable source of information when the focus is on the fear of crime at a national level. Traditional ways of measuring fear of crime strongly rely on costly victimisation surveys, which often have a delay of months or even years between when the data is collected and when the results are published. In addition, surveys might not be comparable between different countries, since different questions may be asked. However, by systematically processing and classifying tweets, we envision the possibility to create a proxy for the fear of crime within the population of a country. Results of such an endeavour can be both timely (with almost no time delay between the time when they are posted on social media) and economically efficient (at a negligible cost compared to the costly surveys), as already demonstrated in other domains (e.g., estimation of the impact of disasters) ((Avvenuti et al., 2016)). This social media expression of fear of crime could be valuable, for instance, to detect people’s reaction to a highly reported crime, or to detect the speed at which that highly reported crime is mentioned by the users.

However, care should be taken with regards to possible forms of manipulations of crime-related social media data. This is not a specific limitation of our study, but rather a limitation related to all analyses that are based on social media data. Among the most common forms of manipulation in social media, are those related to the activity of automated accounts—so-called social bots (Cresci, 2019; Mazza et al., 2019)—and those related to the spread of false and misleading information—e.g., fake news (Docan-Morgan, 2019). Results related to the study of online manipulation and fake content, have however, demonstrated that the majority of malicious activities occur in discussions related to politics (Bessi and Ferrara, 2016) and finance (Cresci et al., 2018, 2019)—that is, in those scenarios characterised by the strongest political or economic interests. To date and to the best of our knowledge, no study has reported orchestrated manipulations of social media data related to crime or fear of crime.