Machine learning partners in criminal networks

Recent research has shown that criminal networks have complex organizational structures, but whether this can be used to predict static and dynamic properties of criminal networks remains little explored. Here, by combining graph representation learning and machine learning methods, we show that structural properties of political corruption, police intelligence, and money laundering networks can be used to recover missing criminal partnerships, distinguish among different types of criminal and legal associations, as well as predict the total amount of money exchanged among criminal agents, all with outstanding accuracy. We also show that our approach can anticipate future criminal associations during the dynamic growth of corruption networks with significant accuracy. Thus, similar to evidence found at crime scenes, we conclude that structural patterns of criminal networks carry crucial information about illegal activities, which allows machine learning methods to predict missing information and even anticipate future criminal behavior.

www.nature.com/scientificreports/ In addition to being useful in classification tasks, we have also verified that the representations obtained from node2vec predict the total amount of money exchanged among agents of a criminal financial network with excellent accuracy. Finally, our investigation shows that one can predict future criminal partners during the growth of political corruption networks. Our research thus indicates that the underlying patterns of criminal networks carry crucial information about the associations among criminals, allowing us to recover possible missing links and properties of these connections, and even to anticipate future criminal associations. Furthermore, the impressive accuracy and the simplicity of deploying trained machine learning methods allows us to conjecture that our approach is likely to be very helpful in future police intelligence operations.

Datasets
Our results are based on four datasets associated with different types of criminal networks. Two of these criminal networks are related to political corruption scandals in Spain and Brazil. The Brazilian data were first used in Ref. 13 and the Spanish data were obtained from Ref. 14 . In both networks, nodes represent people involved in political scandals and connections among them indicate individuals engaged at least once in the same corruption case. The Spanish network has 2695 nodes and 27,545 edges, while the Brazilian network has 404 nodes and 3,549 edges. In addition, we also have information about the growth dynamics of these networks because we know the date of each corruption scandal. As a result, we can reconstruct the growth of these corruption networks by considering corruption scandals occurring up to a given year. The 437 Spanish scandals used in our study occurred between 1989 and 2018, and the 65 Brazilian corruption cases occurred between 1987 and 2014.
Our third criminal network was obtained from Ref. 15 and comprises records of criminal investigations conducted by the Brazilian Federal Police. People involved in this network are criminals or suspected of illegal activities related to federal crimes (drugs and arms trafficking, organized bank robbery, environmental crimes, crimes against elections and financial systems, money laundering, among others), and connections among them indicate individuals involved in the same police investigation or people with personal relationships uncovered during the investigations. This criminal intelligence network has 23,666 nodes and 35,930 edges. For the main component of this network (8894 nodes and 17,827 edges), we also have information about the type of association between individuals collected by the Brazilian Federal Police. This information is original to our work and classifies the edges among individuals into three types: criminal, mixed, and non-criminal. Criminal edges connect people that are solely related for unlawful purposes; non-criminal edges connect people that do not have a criminal association and may include family or friendship ties; finally, mixed connections represent associations that are both criminal and personal (for instance, two brothers involved in a criminal investigation).
The last dataset used is also original to our study and it is related to a money-laundering investigation conducted by the Brazilian Federal Police from 2008 to 2014. The raw data correspond to bank transactions related to the misappropriation of federal public funds. After being aggregated, this information yields a criminal financial network where nodes represent people or companies, and the connections indicate financial transactions among them regardless of the cash flow direction and amount exchanged.

Results
We start our investigation by asking whether one can predict criminal partnerships in a static scenario only using structural information of criminal networks. To do so, we consider the final stages (all political scandals) of the Spanish and Brazilian corruption networks and the criminal intelligence network gathered by the Brazilian Federal Police. Figure 1A-C depict visualizations of these three networks. We first randomly remove 10% of the edges of these networks and sample the same number of false connections to create a test set of true and false links. We then use the 90% remaining edges of these three networks as training sets to fit a logistic classifier 33 to predict whether the links in the test set are true or false. For training this simple statistical learning method, we generate vector representations of nodes in the training sets using the node2vec method 32 . This is one of the most popular network embedding methods and consists of finding vector representations that maximize the probability of nodes co-occurring in sequences of biased random walks with fixed lengths. In our analysis, we have fixed the embedding dimension to 256, walk length to 5, number of walks per node to 10, and random walk bias parameters (breadth-first or depth-first) to 1. These choices represent the default setting and make the embedding algorithm similar to deepwalk 34 . Following Ref. 32 , we create vector representations for network edges by combining the vector representation of nodes with four binary operators: average, Hadamard, and L1 and L2 norms. Finally, we associate these vector representations with true edges in the training sets and the same number of randomly sampled false connections.
We thus train the logistic classifiers using these vector representations of true and false edges from the training sets and estimate the accuracy of our approach by calculating the average fraction of correct classifications in the test sets over ten realizations of the train-test split and embedding processes. Figure 1D shows these accuracies for the three networks and the four binary operators. The accuracy of the logistic classifiers significantly outperforms the baseline accuracy (50%) in all cases. Furthermore, in line with the benchmark results presented in Ref. 32 , we find the Hadamard operator yields the best performance across our three criminal networks. These best accuracies are remarkably high ( ≈98% for the Spanish corruption network, ≈96% for the Brazilian corruption network, and ≈87% for the Brazilian criminal intelligence network), which in turn indicates that structural properties of these networks carry important predictive information about network connections that are well captured by the edge embeddings produced by node2vec.
In Fig. S1, we have compared the performance of node2vec with the LINE 35 and Mercator 36 embedding methods. The general accuracies of these other approaches also outperform the baseline accuracy, but are always lower than the scores obtained with node2vec. We have also verified how the performance of our approach depends  Figure 1E shows these accuracies as a function of the fraction of edges in the training sets used for creating the embedding representations for the three networks. We note that the accuracy in the corruption networks approaches their maximum values much faster than the accuracy in the Brazilian criminal intelligence network. For example, we observe practically no change in the scores of corruption networks after considering ≈60% of edges in the training sets, while the score in the criminal intelligence network monotonically increases with the fraction of edges used in the embedding process. These results indicate that the structure of corruption networks is more redundant than the one observed for the criminal intelligence network. Indeed, corruption networks are formed by a set of complete graphs representing people involved in political scandals that are in turn connected with each other by the recidivism of a small number of agents 14 .
In contrast, criminal intelligence networks can have more complex connectivity patterns that are uncovered by police investigations 15 .
In another application, now focusing on the giant component of the Brazilian criminal intelligence network, we have asked whether the structural properties of this network can be used to determine the type of association among its agents. Figure 2A shows a visualization of the giant component of this network where the three types of edges (criminal, mixed, and non-criminal) are depicted in different colors (red, blue, and green, respectively). This time our task is thus to classify the edge types, and to do so, we have again used node2vec to generate vector representations of edges by combining the node embeddings with the same four binary operators used in the previous applications. After obtaining the vector representations, we separate (stratified by the three classes of In corruption networks, nodes represent people involved in corruption scandals, and connections indicate people participating in the same corruption case. In its turn, nodes in the criminal intelligence network represent people investigated by the Brazilian Federal Police, and an edge between two individuals indicates some co-participation (unlawful or lawful) uncovered by police investigations. (D) Accuracy of logistic classifiers trained for predicting missing links with node2vec representations of nodes and different binary operators. The bars stand for the average accuracy estimated from test sets over ten realizations of the embedding and training processes (error bars represent one standard deviation). The test sets are generated by randomly removing 10% of network edges and sampling the same number of false connections. The horizontal dashed lines represent the baseline accuracy (0.5). (E) Accuracy of logistic classifiers as a function of the fraction of nodes in the training set for each criminal network. The markers represent the average accuracy estimated from test sets over ten realizations of the embedding and training processes with the Hadamard operator (shaded regions stand for one standard deviation band).  37 to balance the class distribution in the training set.
We have thus fitted a k-nearest neighbors (kNN) classifier 33 to the training data and estimated the average accuracy of the approach in the test set over ten realizations of the embedding process for each binary operator. Figure 2B shows these scores in comparison with two dummy classifiers that make predictions based on the relative frequency of each edge type (gray continuous line) and the most frequent edge type (black dashed line). We observe that the accuracy obtained from each binary operator is significantly higher than that of the two baseline classifiers. Again, the Hadamard operator displays the largest accuracy (74%), followed closely by the average operator. Figure 2C presents the confusion matrix of the classification task estimated from the test set using the Hadamard operator (values represent an average over ten realizations of the embedding process). Identifying mixed relationships is more challenging for the kNN algorithm as it correctly classifies this edge type in 55% of cases. In contrast, criminal and non-criminal edges are correctly classified 81% and 77% of times, respectively. It is also worth noticing that the algorithm misclassifies mixed relations as criminal edges more frequently than non-criminal ones, which can be regarded as a suitable property when considering that this type of relationship is always related to a possible crime.
We have explored how the number of neighbors (k) in the kNN classifier affects the accuracy in determining the type of association. Figure 2D shows the average accuracy estimated from the test set over ten realizations of the embedding and training processes as a function of the number of neighbors. We observe that the highest scores are obtained for a small number of neighbors and that the accuracy monotonically decreases with the number of neighbors. The results presented in Fig. 2B,C are for k = 1 as this value yields the highest accuracy. In addition, we have also verified how the accuracy depends on the fraction of edges used for training the kNN model. To do so, we consider a variable fraction of edges (X) for training the kNN model and use the remaining edges [ (1 − X) %] as the test set. Figure 2E shows that the average accuracy calculated from the test set www.nature.com/scientificreports/ monotonically increases with the fraction of edges in the training set. However, the accuracy changes are much more intense for lower than higher fractions of edges used for training the learning method. In our third application, we have tried to predict the amount of money exchanged among agents in the criminal financial network only using the structural information of this network. Figure 3A depicts a visualization of this network where the edge thicknesses are proportional to the logarithm of the amount of money exchanged between pairs of nodes. Similarly to what we have done before, we have used node2vec to create vector representations of all edges in this network with the same four binary operators. However, we do not include any information about the amount of money, such that only the existence or not of (undirected and unweighted) links among nodes is used during the embedding process. After obtaining the vector representations, we have associated them with the logarithm of the amount of money for each network edge and split the resulting dataset into training (90%) and test (10%) sets.
We thus train kNN regressors to predict the logarithm of the amount of money and estimate the performance of our approach by calculating the coefficient of determination ( R 2 score) between the predicted and actual values in the test set. We further average this quantity over ten realizations of the embedding and training processes. Figure 3B shows the average R 2 score obtained for each binary operator in comparison with two baseline regressors that always predict the average (black dashed line) and median (gray continuous line) of the training sets values. The kNN models perform much better than the baselines and yield R 2 scores around 0.6 for all binary operators, but again the Hadamard operator displays the highest performance ( ≈0.64%). Figure 3C illustrates the typical association between the predicted and observed values in the test set obtained with the Hadamard operator. We have also investigated the roles of the number of neighbors (k) and the fraction of edges in the training set (X%) on R 2 scores obtained from the test sets [ (1 − X) %] with the Hadamard operator, as shown in Fig. 3D,E. We observe that k = 6 leads to models with the highest performance, and indeed, we have used this value for the results in Fig. 3B,C. For the fraction of edges in the training set, we note that the R 2 score saturates www.nature.com/scientificreports/ approximately after considering more than 50% of edges. Although there is certainly room for improving these scores, these results show that our approach works well not only in classification but also in regression tasks. Finally, in our last application, we have considered the more challenging problem of predicting future criminal partnerships using the structure of criminal networks. We focus on the two corruption networks because we have the network growth dynamics only for these cases. As we have already mentioned, these criminal networks grow by the inclusion of novel corruption scandals containing first-time-offenders and recidivist criminals, with the latter being responsible for creating bonds between different corruption scandals. To approach this problem, we consider scandals occurring up to a given year Y to build the criminal network G Y and use node2vec for creating vector representations for all nodes. We then use these node embeddings to produce vector representations for all network edges and the same number of randomly sampled false connections with the four binary operators. Considering this information as the training set, we train a logistic classifier to distinguish between true and false links. Next, we analyze all corruption scandals occurring after the year Y and collect all connections among nodes already present in G Y . These connections represent future criminal partnerships among agents in G Y . We consider the node embeddings obtained from G Y to create vector representations for these true future connections and to the same amount of randomly sampled false links that do not occur in the future of G Y , defining our test set. Finally, we apply the trained logistic classifier to determine whether the connections in the test set are true or false and to estimate the average accuracy of our approach over ten realizations of the entire process. Note that no information about scandals occurring after the year Y is used to create the vector representations of edges in the test set or to train the logistic model.
The central panel of Fig. 4 shows the average accuracy in the test sets when considering different threshold years (Y) for both the Spanish (red circles) and Brazilian (blue squares) corruption networks. The insets indicated by arrows display visualizations of G Y for a few years, highlighting future criminal partnerships by gray edges. These insets further show the confusion matrix of the classification process obtained from the test sets. The results in this figure use the Hadamard operator for the Spanish network and the average operator for the Brazilian network because these choices yield the highest average accuracies (see Figs. S2 and S3 for a comparison among the four binary operators and for results obtained with kNN classifiers). We observe that the logistic classifiers yield accuracies higher than 0.8 in most years of the Spanish corruption network, significantly outperforming the baseline score (0.5). For the Brazilian corruption network, the classification scores do not differ from the baseline accuracy for years before 2003. After this year, the scores fluctuate around ≈0.65 and significantly outperform the baseline accuracy. Taken together, these results demonstrate that it is possible to predict future criminal partners www.nature.com/scientificreports/ using only structural information of criminal networks with good precision. Despite that, the accuracies obtained here are lower than those obtained in our static scenario where edges are removed and then recovered in the final stages of these corruption networks (Fig. 1A). Thus, link prediction in time-varying networks is indeed more challenging, and results obtained in static scenarios may not generalize well to time-dependent settings.

Discussion
We have demonstrated how structural properties of criminal networks and machine learning methods can be used to predict links and link features among actors engaged in nefarious activities. Our research has been carried out using criminal networks associated with political corruption, police intelligence, and financial transactions. In particular, we have shown that simple logistic classifiers trained with embedded representations obtained from node2vec are capable of predicting criminal partnerships with excellent precision in static scenarios where a fraction of network edges is removed and then recovered. Beyond predicting whether a link exists or not, we have also shown that k-nearest neighbor classifiers trained with vector representations obtained from node2vec correctly distinguish between criminal, mixed, and non-criminal relationships in approximately three out of four connections in a police intelligence network. Furthermore, the same embedding approach combined with k-nearest neighbor regressors predicts the total amount of money exchanged among agents of a criminal financial network with very good accuracy. Finally, we have shown that structural properties encoded by node2vec and learned by simple logistic models can predict future criminal partnerships during the growth process of corruption networks. Our work, however, does not go without its limitations. One is undoubtedly the information quality used to create criminal networks. Despite the efforts to make such information trustworthy, we must remember these data come from police investigations of illegal and hidden activities, such that missing relationships or noise effects are likely to be present and affect the performance of our machine learning methods. This issue can also partially explain the lower performance we have observed when predicting future criminal associations. Unfortunately, and as also occurs in many other empirical works with social systems, noisy data and missing information are more a rule than an exception. Another limitation is the lack of straightforward interpretations of machine learning methods and the consequent difficulty in deriving causal relationships from these models [38][39][40] . Fortunately, there is a growing consensus that, in addition to delivering high prediction accuracy, machine learning methods must also be capable of producing knowledge from data, a domain that is referred to as "interpretable machine learning" and that is experiencing rapid developments 41 , particularly in the context of graph representation learning 42,43 .
Despite these limitations, our research strongly corroborates the fact that partnerships among criminals are far from being driven by random circumstances. Indeed, our results indicate that similar to evidence found at crime scenes, criminal associations exhibit patterns and carry crucial information that can be learned by machine learning methods and used to predict missing information or even anticipate the future behavior of agents in criminal networks. Machine learning methods can take vector representations of suspected agents and estimate probabilities for the existence of connections among them and whether they are likely to be criminal or not. It is also worth remarking that we are witnessing a recent surge in research on graph representation learning which in turn yields a large number of techniques for generating effective vector representations for nodes, edges, and entire graphs [28][29][30][31] . These methods can be roughly classified into two categories: traditional graph embedding methods and graph neural networks 44 . The methods we have used are included in the first category, where the vector representations are obtained by optimizing some notion of proximity among nodes of the graph. On the other hand, graph neural networks were proposed even more recently (particularly graph convolutional networks) and belong to the class of deep learning models, where vector representations are obtained by aggregating node neighbors' representations and optimizing loss functions related to specific learning tasks. In addition to being task-specific, graph neural networks can generalize to unseen nodes and explicitly consider node and edge features. Thus, despite the excellent accuracy we have obtained with node2vec, exploring other graph representation methods such as graph convolutional networks seems a promising possibility that future research may address. Regardless of being traditional or based on graph neural networks, all these methods can be easily deployed in practical applications involving police intelligence operations, making them potentially useful for helping, guiding, and optimizing police and judicial inquiries.

Data availability
Datasets describing the corruption networks and the police intelligence network are freely available on the internet (see Refs. [13][14][15] ). The dataset for the criminal financial network is available from the corresponding authors upon request.