Short text classification approach to identify child sexual exploitation material

Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs.


Introduction
In 2017, the Council of the European Union (EU) prioritized cybercrimes related to Child Sexual Abuse (CSA), considering them as the most serious crimes between the years 2018 and 2021 [1].According to The European Police Office, Child Sexual Exploitation Material (CSEM) is defined as sexual abuse of a person under 18 years old and producing images or videos of the abuse and distributing such content online [2].
Darknets, such as The Onion Router (Tor)1 [3,4,5] and FreeNet2 [6] and also Peer to Peer (P2P) networks, like eDonkey, [7,8] are environments where the interchange of CSEM seem to proliferate, thanks to the high level of privacy and anonymity provided to their users.These characteristics allow pedophiles to easily share CSEM far away from Law Enforcement Agencies (LEAs) monitoring.It is worth mentioning that during the COVID19 outbreak, Interpol has reported a significant increase in exchanging CSA material in P2P and Darknet networks as well as online gaming and messaging applications [9].
CSEM producers and consumers might save this content on their local computer machines, at least temporarily.When an LEA inspects a home to analyze a suspect's computers, a police agent reviews the files in the investigated hard drive, trying to determine whether or not the suspected of pedophilia has stored CSEM in the computer [10].This process needs to be accomplished in a limited time and as accurately as possible [11].This work aims to build a File Classifier (FC) that decides whether a given file is related to CSEM or not according to its name and absolute path.The FC will not tackle the content as other modules, which are out of this paper's scope.Hence, FC will act as a preliminary filter in a CSEM detection pipeline.
Building an automatic FC is a challenging task due to several reasons.Firstly, a binary supervised algorithm requires training samples of Non-CSEM and CSEM files.However, there are no publicly available datasets of the latter class, and crawling samples from a P2P network or the Darknet is illegal [12].Therefore, only CSEM file names obtained legally, i.e., provided by LEAs, could be used.Secondly, a file name typically is a text of small length, which leads to a sparse representation of the samples because we have a massive number of features, while an instance is only represented with a few of them.Finally, CSEM producers or consumers tend to invent a personalized file name style to create their vocabulary, abbreviations, and acronyms to circumvent detection tools, using a personalized obfuscated writing style.For example, in a sample named "!!!!yoB0yXX ", the exclamation marks refer to the age of a boy, and the letter O is replaced by the number zero.Hence, most likely, this file is related to the abuse of a four years old boy.It worth mentioning that the last two challenges, i.e. the lack of context and the deliberate distortion of the text, make it more difficult to build or to use pre-trained language models, such as word2vec [13] and GloVe [14], and contextualized word representations, like Bidirectional Encoder Representations from Transformers (BERT) [15] and Embeddings from Language Models (ELMO) [16].Nevertheless, high accuracy may not be achievable as some of these resources carry minimal information.This would play a key role in filtering those files with a higher probability of being CSEM and facilitating the analysts' work that otherwise would be unfeasible.
A small body of research was focused on the problem of identifying CSEM via their file names.The most recent works are the research of Pereira et al. [17] and Al-Nabki et al. [18] where they experimented with different deep learning and machine learning algorithms to build a supervised classifier for the files.
Unlike the common strategies that use file names only, this paper attempts to dive further to incorporate their absolute path in parallel.The approach of using both pieces of information, i.e. the name and the path, has been presented for the first time by Pereira et al. [17] when they train a single classifier to classify the absolute file path, including the file name.In contrast, the File Classifier (FC) we propose in this paper uses dedicated classifiers for each component, a File Path Classifier (FPC) to classify the absolute paths and a File Name Classifier (FNC) for file names.Their outputs are fused into a single score.This design will prevent the classification decision from skew to the absolute file path's content only, which typically occupies most of the text.
We counted on our previous work for the FNC design [18] but after extending the file name representation and using a bigger dataset.Furthermore, this paper elaborates on the FPC and demonstrates how its output was integrated with the FNC.
The file name and the file path could complement each other when any of them carries a CSEM pattern.Nevertheless, this approach will not be advantageous when neither sources exist, such as a file named only with numbers and located in the root folder.We propose two approaches to build the FC (see Fig. 1).The first one uses two standalone classification models, one for the FNC and another for the FPC.The outputs of these two classifiers are fused into a single output.The other approach employs the FNC only to classify the file name and the path.It iterates over the absolute path along with the file name, and whenever the FNC detects a CSEM name within the path, it reports the file as CSEM.
The main contributions of this paper are summarized as follows.
• We propose a framework for classifying files into CSEM or safe material based on the fusion of the output of two supervised classifiers, which uses file names and their absolute paths.
• We extend the text of the file names by appending two additional intermediate representations suited for the task of CSEM detection.The first one is a novel binary representation, which distinguishes character blocks from noncharacter ones.The second is an orthographic feature that captures the variation in the types of file name characters.To the best of our knowledge, the orthographic feature has not been used before to code file names for text classification, but for named entity recognition tasks [19,20].
• We build a dataset with 5.9M and 890K unique file paths and file names samples, respectively.To be the best of our knowledge, this is the largest dataset used for classifying CSEM using file names and paths.
• We apply our framework into a real-case application: CSEM detection.We also introduce our framework into a practical forensic tool that could support the task of CSEM detection to the LEA worldwide.In this configuration, FNC is used to iterate over the path's directories and the name of the file.
The rest of the paper is organized as follows: Section 2 presents the related work.Section 3 describes the proposed classification methodology.Next, Section 4 explains how the datasets of the FNC and the FPC are created and what are their main features.Then, in Section 5, we describe the experimentation performed and discuss the results obtained.Finally, Section 6 presents the conclusions by pointing to our future research.

Related Work
The use of filename classification in recognizing CSEM has not received much attention despite its efficacy in identifying potential forensic evidence.To the best of our knowledge, only a few research papers have been published in recent years.
To begin with, Panchenko et al. [7] attempted to normalize file names using Short Message Service (SMS) normalization techniques proposed by Beaufort et al. [21].With the normalized text, they trained a Support Vector Machine (SVM) classifier and obtained an accuracy of 96.97% on their dataset.Peersman et al. [22] proposed a framework called iCOP to detect CSEM on P2P networks.The first stage of their classification pipeline was a dictionary-based filter that was constructed manually and held CSEM keywords.They used character n-gram of size two to four to capture more features about the file name and a binary SVM as classifier.Afterward, in their recent work [8], Peersman et al. used a similar representation but benchmarked more classifiers, like SVM and Naive Bayes (NB).Due to the lack of a public dataset for this task, they evaluated their proposal on a custom dataset as well, and they observed that the SVM classifier could identify CSEM file names with a recall rate of 0.43.Al-Nabki et al. [18] compared the use of machine learning classifiers, such as SVM and LR, that use character n-gram with Term Frequency-Inverse Document Frequency (TF-IDF), versus deep learning classifiers that depend on Convolutional Neural Network (CNN).Specifically, they adopted two CNN models developed by Zhang et al. [23] and Kim et al. [24].The model of Zhang et al. was the best benchmarked CNN-based classifier and obtained an F1 score of 0.85, while the machine learning approach using LR classifier scored a slightly lower F1 score of 0.84.The major difference was their processing time on a CPU machine where the latter was by far quicker than the former one.Pereira et al. [17] compared several machine learning and deep learning models to classify files using the file names and paths.They conducted the experiments on a dataset of 1, 010, 000 file paths from 55, 312 unique storage systems provided by Project VIC International.Similar to the conclusion of Al-Nabki et al. [18], they found out that the CNN character-based model proposed by Zhang et al. [23] achieves the best recall rate of 0.94 The problem of CSEM identification through file names could be approximated to a wider research topic, such as short text classification [25,26,27,28,29,30], and in particular, news headlines classification and Twitter posts classification.
The news headlines classification task attempts to group news articles based on their titles, in which the title typically is made up of a few words.Rana et al. [28] proposed a pipeline of three stages: data pre-processing, text representation, and classification.In the data pre-processing step, the text was tokenized into words, and spaces replaced special characters, stop words were removed, and the text was stemmed.For the text representation, the authors used TF-IDF, Information Gain (IG) [27], and Boolean Weight (BW) [31].Finally, in the classification stage, Rana et al. explored NB [32], SVM [33], K-Nearest Neighbor (KNN) [34], and Decision Trees (DT) [35].However, the core difference between our problem and news headlines classification is that the latter has high-quality input text, where the punctuation marks are maintained correctly, and there are no misspelled words.
Classifying tweets of Twitter would also fall under the umbrella of short text classification as the common length of a tweet is 33 characters, while the maximum number of characters is 280 [36].Furthermore, the quality of the text could be low in comparison to the news headlines problem, and it might contain abbreviations to save space or misspell some words [36].Imran et al. [37] pre-processed the tweets by removing hyperlinks, mentions, and stop words.Then, they used the N-grams and IG techniques for feature extraction and a Random Forest (RF) classifier [38].Chen et al. [39] proposed a framework to identify cyberbullying on Twitter.For text representation, they compared pre-trained language models, like Word2Vec [13] and GloVe [14], with traditional text encoding techniques, such as TF-IDF, and they realized a decline in the performance when embedding-based were used.For classification, they compared traditional machine learning classifiers like LR and SVM with deep learning classifiers, like Long Short-Term Memory (LSTM) [40] and Convolutional Neural Network (CNN) [23].Carnevale et al. [41] proposed an algorithm to classify noisy and low-quality text generated from critical patients' posts on Twitter.The authors employ n-gram with TF-IDF for feature extraction and benchmark two classifiers, SVM and NB, to compare their performance on the task.
Furthermore, the problem of file path classification could be treated as a branch of URL classification since both share similar characteristics in terms of the structure and the use of concatenated words.This topic has been investigated widely by many researchers [42,43,44,45,46,47].Sahingoz et al. [46] used a URL classification approach to identify phishing websites through their addresses.Sahingoz et al. explore various features extracted manually from the URL, and they used them to benchmark several machine learning classifiers, such as DT, SVM, and RF.Trevisan et al. [45] examined the use of Generative Adversarial Neural Networks (GANs) to classify four classes of URLs given its ability to cope with the lack of training samples problem.

Methodology
This section introduces two approaches for designing the File Classifier (FC) in order to identify CSEM.The first one involves two standalone classifiers, one for the file name and another for the file path, and the outputs of these two classifiers are fused to a single value that represents the prediction confidence.The second approach is to build a single classifier for the file names.Since the path consists of a sequence of file names, a file name classifier can iterate over the sub-directory names starting from the root directory to the file name.Finally, the prediction confidences of the sub-directories are fused.Both approaches have a typical machine learning design that consists of three main stages [5]: text pre-processor, feature extractor, and a classifier.In the following, we elaborate on each approach in detail.

Two Standalone Classifiers Approach
In the following, we present two classifiers, a File Name Classifier (FNC) and a File Path Classifier (FPC).Each classifier has its dataset for training and testing.

File Name Classifier
The FNC presented in this paper attempts to enhance the previous implementations explored in the literature by: 1) enhancing the file name representation and 2) training on a larger and more representative dataset (Section 4.1).
File Name Feature Extraction.Finding an adequate representation of the text is a crucial step in a classification pipeline.For this work, we used character n-gram to extract all the patterns of two to five consecutive characters of an input file name, which builds a set of tokens.Then, we apply the well-known TF-IDF technique [48] since it gives higher weight scores to grams whose frequency is higher in a few file names and, at the same time, decreases the weight of grams that frequently occur in many files.This way, it overcomes the issue of misspelled words or personalized naming style in file names.Table 2 shows an example of two to five grams of a file name "!!!!yoB0yXX ".Furthermore, to discard noisy tokens, we set thresholds for the minimum and the maximum term frequency.
File Name Classification.After having the features extracted, we use them for training the FNC.Based on previous research [18] and considering both classification performance and execution time, we use Logistic Regression.

File Path Classifier
The FPC is a supervised binary classifier to decide whether a given file's absolute path is CSEM related or not.The FPC consists of the following three components: File Path Pre-processing.Only the file paths are pre-processed at this stage since the FNC already handled the file names.Initially, the path is converted into a string by replacing the slash sign (/) with space.Next, we replace special characters and digits by # and $, respectively.Finally, using the regular expression library, we split the text by capital letters if exist.Table 3 illustrates the pre-processing procedure applied to two samples.
File Path Feature Extraction.The problem of path classification is similar to the file name classification.In both cases, we could not use pre-train models because most of the text will be out of vocabulary and will not be represented properly.For this reason, we used the same feature extraction technique we used for the FNC, i.e. using n-gram technique, between two to five grams, that works on the character level.We applied it along with TF-IDF algorithm, as described in Section 3.1.1.File Path Classification.After having the features extracted from the file paths, we use them for training a binary supervised Logistic Regression classifier, which will identify CSEM paths from the regular ones.

Fusing File Name and File Path Classifiers
This section aims to present how we aggregate the prediction of the two classifiers, the FNC and the FPC, into one prediction value.The desired fusion strategy must be sensitive to potential CSEM, either in the file name or the file path.Hence, our fusion strategy returns the result of the classifier, which has the highest CSEM confidence.For example, for a given sample x, the FNC predicts it is CSEM with 20% confidence and 80% otherwise, while the FPC predicts it is CSEM with 40% and Non-CSEM with 60%.In this case, the FPC confidence for the CSEM is higher than the FNC's confidence, and therefore the result of the FPC will be the final output of the fusion.Formally, Eq. 1 explains the following procedure.
where F C (x) refers to the classification result of a sample x, and the F NC (x) C SE M and F PC (x) C SE M refer to the classifier confidence regarding the CSEM class.

Single Iterative Classifier Approach
Typically, the absolute path of a file is made up of a sequence of folder names.This approach considers that each folder is a standalone file name, and it uses the previously implemented FNC (presented in Section 3.1.1)to classify it.Therefore, if an entry path has N sub-directories, including the file name, the FNC will be called N times and classify N entries.If any of these N entries were reported as CSEM, the complete path is considered CSEM.Otherwise, the entry is considered as non-CSEM.Unlike the majority voting approach, this technique is highly sensitive to any suspicious sub-directory name mentioned in the input path.The prediction complexity of this approach is proportional to the depth of the absolute path.Hence, for M samples and each has N sub-directory, the complexity would be O(N × M ).

Dataset Construction
Another motivation to split up on these two themes is that a unique path could contain hundreds of files, resulting in hundreds of file name samples.Table 4 shows five file names that refer to two unique paths.
Furthermore, we noticed that the lack of explicit CSEM-related words distinguishes the paths.Instead, considering the words of the whole path sub-directory at once may lead to suspicious content of that path.To illustrate this, Table 4 gives two unique path samples.In the first example, the word "Sarah" standalone or "Silver Starlets" are not CSEM-related, but their existence with other directories named "Starlets", "skirt", and the number five ( the last directory of the first example) could be an indicator of a sequence of photos for a 5-years old girl dressing a pink skirt.
Table 4: Examples of file paths dataset along with their corresponding file names

File Name Dataset
For the negative class, i.e. the safe files, we used a dataset published by the National Software Reference Library (NSRL) 3 that contains more than 32 million file names.We selected an initial subset of 800, 000 Non-CSEM examples, resulting in 537, 807 after applying the pre-processing step.Regarding the CSEM class, we collected these examples thanks to the collaboration between the Spanish National Cybersecurity Institute (INCIBE) 4 and the Spanish LEAs.This latter provided us with a list with dumps of hard disks seized from criminals' computers.The list had 90, 000 CSEM samples.However, after pre-processing them, the number decreased to 37, 648 unique instances.

File Path Dataset
Similar to the file name classifier, the file path classifier has two classes, CSEM and Non-CSEM.For the Non-CSEM class, we gathered 3, 031, 802 unique paths for dumps of eight computer machines that host Non-CSEM files and 2, 864, 105 for the CSEM class that was provided to us by the Spanish LEA.After pre-processing these paths, we ended with 2, 065, 590 unique instances distributed as 924, 445 and 1, 141, 145 for the CSEM and the Non-CSEM classes, respectively.

Experimental Setting
The experiments were carried out on a PC with an Intel(R) Core(TM) i7 processor with 32 GB of RAM under Windows-10.We used Python3 with Scikit-Learn5 for implementing the classifiers.
Regarding the File Name Classifier's configuration, we used character n-grams, extracting patterns from two to five grams [18].Also, we set thresholds for the minimum and the maximum gram proportion to 0.999 and 0.0005, respectively.For the LR classifier, we set the parameter C to 100, empirically, which refers to the inverse of regularization strength, and we activate the class weight parameter to consider the imbalance of the classes while training.The rest of the parameters were left to their default values, as the Scikit-Learn library set them.The File Path Classifier used the configurations as the File Names Classifier.
To estimate the models' performance, we report the performance of each classifier on a test set.For the File Name Classifier, the dataset has 890, 000 samples before pre-processing and we split by 80/20 for the training and the testing sets, respectively.Detailed description of the dataset size information is given in Table 5.Unlike the file name samples, we could not split the training and the testing set on a fixed percentage, and this is because the samples of these sets must be nonintersected.Hence, the machines used to obtain the file paths' dump are distinct.Table 6 gives detailed information about the dataset class size.Finally, to test the performance of both file name and file path models, we created a binary dataset of 50, 000 samples, equally distributed between the classes.We sampled 50, 000 file paths and another 50, 000 file names of the test sets of the path names and the file names sets randomly, respectively.Then, we created a balanced synthesized test set by fusing these two sets.A sample is considered CSEM if its name or path were sampled from a CSEM instance; otherwise, it is tagged as Non-CSEM.

Evaluation Metric
The principal objective of this work is to assist LEAs in detecting CSEM through their file names, avoiding the exposure of an agent to CSEM.Therefore, it is desirable to have a low number of false negatives -a file named with CSEM content identified as a Non-CSEM -than a low number of false positives, i.e., Non-CSEM file name wrongly categorized as a CSEM.Hence, it is desirable to obtain a high recall of the CSEM class rather than the Non-CSEM class.
Recall metric for a class is calculated as the total number of samples correctly classified for that class (the True Positives TP), over the total number of samples of that class (the True Positives TP and the False Negatives FN).Equation ( 2) shows how Recall is estimated for a given class.

Rec al l = T P T P
Nevertheless, the precision of a classifier is also a crucial factor in measuring its performance, as it shows the proportion of correctly identified samples.Class precision is calculated as a ratio of correctly classified file names of that class (the True Positives TP) to the total number of predicted positive samples of that class (the True Positives TP and the False Positives FP), and it is given in Equation (3).

P r eci si on =
T P T P + F P .
Finally, the F1 score of a class summarizes the two before-mentioned metrics as it refers to the harmonic mean of the precision and recall and it is calculated following to Equation (4).
Additionally, it has been proved that the accuracy metric is not reliable when the dataset is imbalanced [49], as in our case, where the majority of the samples are Non-CSEM file names.An alternative metric is to use average class recall, rather than using overall dataset level accuracy.

Empirical Results
In this section, we evaluate both classifiers and the proposed fusion methods, as described in Section 3.
Table 7 analyzes the impact of these features used to boot the file name representation.Our results show that when all the representation techniques are joined, we could obtain the best classification performance for the FNC with an average class recall of 0.98 and an F1 score of 0.96.Afterward, we evaluated the FPC on its test set, as shown in Table 8.The FPC obtained 0.97 for both of the average class recall and the F1 score, which is slightly higher than the FNC, which scored 0.98 and 0.96, respectively.In addition to reporting the performance of each classifier individually, we analyze two techniques of fusing them, as described earlier.Table 9 shows that using two standalone classifiers, one for the file path and one for the file name, surpasses the single iterative classifier approach.The two classifiers architecture could achieve an average class recall of 0.98, which is higher than the other approach that iteratively uses the FNC and scores 0.74.

Conclusions and Future Work
In this paper, we presented a supervised machine learning approach to identify files that may contain Child Sexual Abuse Material (CSEM) from regular files (Non-CSEM).Given that this work aims to build a fast CSEM prediction, only file names and paths are used.We proposed two solutions: 1) building two standalone classifiers: a File Name Classifier (FNC) and File Path Classifier (FPC), and then fusing their outputs into a single decision, and 2) dividing the file path into a list of folder names and using the FNC to classify each name in the path.Our results strengthen the superiority of the former approach as it obtained an average class recall of 0.98, while the latter scored an average class recall of 0.74.
For the FNC, we pre-processed the text and boosted it with two features: binary and orthography, which increased the recall rate of the CSEM class from 0.89 to 0.93 and scored an average class recall of 0.98.Regarding the FPC, it used similar architecture to the FNC, but it differs in the pre-processing stage, and it achieved an average class recall of 0.97.
The empirical evaluation was conducted on a dataset extracted from the file names and file paths.As future work, we are looking forward to enlarging the dataset by obtaining samples from various seized computers, allowing the model to be exposed to wider CSEM file names patterns.Furthermore, once the dataset is extended, we aim to build a character-based language model [50] for CSEM files.The assessment of transformer-based models, such as BERT [15], and XLNet [51] for text classification is part of our immediate future research, as they have shown promising results on various NLP tasks.

Figure 1 :
Figure 1: Two classification approaches to classify files.The structure shown to the right of the figure demonstrates how the two standalone classifiers approach works.In this configuration, FNC and FPC classify the file name and file path, respectively.Then, their outputs are fused to determine the file category.While the structure shown on the left illustrates how the single iterative classifier approach works.In this configuration, FNC is used to iterate over the path's directories and the name of the file.

F 1 = 2 *
(Rec al l * P r eci si on) (Rec al l + P r eci si on).

Table 1 :
Example of preprocessing and tokenizing a file name with two, three, four, and five grams

Table 2 :
Example of pre-processing and tokenizing a file name with two to five grams

Table 3 :
Stages of the path pre-processing procedure

Table 5 :
Description of the used dataset to train the FNC.

Table 6 :
Description of the used dataset to train the FPC

Table 7 :
The impact of file name representation on the performance of the FNC.

Table 8 :
The performance of the FNC and the FPC classifiers on their test sets.

Table 9 :
A comparison between two techniques of fusing the FPC and FNC models.The values in bold refer to the best prediction F1 score.