A Python library to check the level of anonymity of a dataset

Openly sharing data with sensitive attributes and privacy restrictions is a challenging task. In this document we present the implementation of pyCANON, a Python library and command line interface (CLI) to check and assess the level of anonymity of a dataset through some of the most common anonymization techniques: k-anonymity, (α,k)-anonymity, ℓ-diversity, entropy ℓ-diversity, recursive (c,ℓ)-diversity, t-closeness, basic β-likeness, enhanced β-likeness and δ-disclosure privacy. For the case of more than one sensitive attribute, two approaches are proposed for evaluating these techniques. The main strength of this library is to obtain a full report of the parameters that are fulfilled for each of the techniques mentioned above, with the unique requirement of the set of quasi-identifiers and sensitive attributes. The methods implemented are presented together with the attacks they prevent, the description of the library, examples of the different functions’ usage, as well as the impact and the possible applications that can be developed. Finally, some possible aspects to be incorporated in future updates are proposed.


Introduction
The unstoppable advances in data analysis and processing techniques for knowledge extraction and decision making, whether concerning Big Data or small data, motivate the idea of publishing datasets in an accessible way for the scientific community and society in general.In the same way, the need for collaboration between different institutions, research centers, or even companies, means that they need to be able to share the necessary data among themselves in a secure way.Furthermore, the need to publish open data in order to build an informed society about, for example, the processes of public organizations, intensifies the need to develop tools that allow the publication of such data with privacy guarantees.
When discussing the privacy guarantees that a dataset must fulfill, we are referring to preventing an attacker from extracting sensitive information from a specific individual.There is a strong requirement to protect certain data, such as health records (hospital visits, chronic illnesses), banking information or even police reports.For example, in the case of mental health records, the disclosure of such data about an individual could lead to discrimination and social exclusion.A person's medical data can also be used in an unethical manner to increase the cost of health insurance, or even lead to social foreclosure if it is revealed that a certain individual suffers from a specific disease.Similarly, information on political beliefs or even on an individual's monthly income could lead to exclusion and prejudice.Every day we are all producing a digital footprint that reveals our tastes, hobbies or even fears.In many cases it is reflected in personalized advertising, but to what extent is this good for the user?If someone always consults news of certain political behavior, some algorithms will hide news that show contrary opinions, resulting in an intellectual isolation of the users and a lack of critical thinking of the society.
A classic example of data that was openly published and eventually showed a security breach is that of the US Census.Specifically, a study conducted on 1990 U.S. Census data revealed that 87% of the time, three pieces of information (zip code, gender and date of birth) were sufficient to identify someone in a database (see 1 ).This shows that it is not enough to remove identifiers from a database to prevent an individual from being identified, since he/she can be identified by other types of attributes which a priori seem harmless.Let's briefly define four key concepts about the attributes or columns of a database.Firstly, the identifiers are those variables in a database that allow to identify an individual (e.g.name, ID number, email...).On the other hand, the quasi-identifiers (QI) are variables that although a priori seem not to show relevant information, combined between them can make possible the identification of an individual, as occurs in the case of the U.S. Census.Moreover, they are accessible to the attacker (e.g.gender, age, city, etc.).Finally, we have the sensitive attributes (SA), which are the columns of the database that contain sensitive information that must not be disclosed and must not be able to be extracted.
To this aim, several software projects and libraries emerge.The main tool to highlight when talking about data anonymization is the ARX Software 2 : ARX is a comprehensive open source software for anonymization of sensitive data that supports a wide variety of privacy, risk and quality models, together with methods for data transformation and for analyzing the usefulness of the output data.PRIVAaaS 3 is another tool focused on providing policy-based anonymization in distributed data processing environments, aimed at reducing the data leakage in Big Data processing.ARGUS (including the µ-ARGUS and τ-ARGUS packages) is a software library for Statistical Disclosure Control delivered by the CASC-project focused on microdata and tabular data.
This paper presents pyCANON, a Python library and CLI that allows to check if certain anonymity conditions are verified for a dataset, given a set of quasi-identifiers and a set of sensitive attributes.The main idea behind it is to provide researchers, and in general anyone who wants to publish a dataset in open access or to share it with others, with a prior knowledge of the level of anonymization of their data.This will provide insights about the possible risks to which these information would be exposed, allowing to verify the impact and their resistance to different attacks.An important consideration is that these data have to be provided in tabular format, that is, they cannot be in images, videos or any other kind of unstructured data.
In the following sections, the theoretical bases of the implemented methods are presented, as well as a brief explanation of the main functionalities of the software, examples of use, and the impact and different applications associated with it.

Results
As explained in the previous section, the purpose of the library presented in this article is, essentially, to check the anonymity level of a dataset.For this purpose, we propose the use of nine different anonymization techniques: k-anonymity, (α,k)anonymity, -diversity, entropy -diversity, recursive (c, )-diversity, basic β -likeness, enhanced β -likeness, t-closeness and δ -disclosure privacy.Specifically, given a set of data, a list of quasi-identifiers and a list of sensitive attributes, it will be possible to check for which parameters each of the aforementioned techniques are verified, in order to know the degree of anonymity of such data, and thus the possible risks they may be subject to.
Before exposing the definitions and different aspects taken into account for the implementation of the different techniques, let us present the concept of equivalence class (EC).An equivalence class is a partition of a database in which all the quasiidentifiers have the same value.That is, users who are in the same equivalence class are all indistinguishable with respect to the quasi-identifiers.The different techniques under study are presented below: • k-anonymity.A database verifies k-anonymity if each equivalence class of the database has at least k rows.In other words, for each row of the database, there are at least k − 1 indistinguishable rows with respect to the quasi-identifiers.Note that k ≥ 1 is always verified.
• (α,k)-anonymity.Given only one sensitive attribute S, it is checked if the database is k-anonymous and the frequency of each possible values of S is lower or equal than α in each equivalence class.
• -diversity.In the case of a single sensitive attribute S, it is satisfied if for each equivalence class, there are at least distinct values for S. Note that ≥ 1 is always verified.
• Entropy -diversity.A database with a single sensitive attribute S verifies this condition if H(EC) > log( ), for every equivalence class EC of the database.Note that H(EC) is the entropy of the equivalence class EC, defined as: with D the domain of S, and p(EC, s) the fraction of records in EC that have s as sensitive attribute.
• Recursive (c, )-diversity.The main potential of this technique is that if a value of the sensitive attribute S is removed from an equivalence class which verifies (c, )-diversity, then (c, -1)-diversity is preserved.For the implementation of this technique, 4 has been used as reference (in order to get the formal definition of the concept).Specifically, suppose there are n different values for a sensitive attribute S in an equivalence class EC.Be r i (i ∈ {1, . . ., n}) the number of times that the i-th most frequent value of S appears in EC.Then, EC verifies recursive (c, )-diversity for S if r 1 < c(r l + r l+1 + ... + r n ).
In view of the previous inequality, in our implementation of this technique the value of c will not be calculated if a value of = 1 is obtained.
• Basic β -likeness and enhanced β -likeness.These two techniques have been implemented following the definitions 2 and 3 of 5 , and can be used in order to control the distance between the distribution of a sensitive attribute in an equivalence class and in the entire database.In particular, be P = {p 1 , . . ., p n } the distribution of a sensitive attribute S in the whole database and Q = {q 1 , . . ., q n } that of an equivalence class EC.Be max(D(P, Q)) = max{D(p i , q i ) : In both cases β > 0. Note that enhanced β -likeness provide more robust privacy than basic β -likeness.
In our implementation the relative distance function is considered in order to calculate the distance between the distributions, that is: D(p i , q i ) = q i −p i p i .
• t-closeness.The goal is again similar of that of the two previous techniques.A database with one sensitive attribute S verifies t-closeness if all the equivalence classes verify it.An equivalence class verify t-closeness if the distribution of the values of S are at a distance no closer than t from the distribution of the sensitive attribute in the whole database.In order to measure the distance between the distributions, following 6 , the Earth Mover's distance (EMD) between the two distributions using the ordered distance is applied for numerical sensitive attributes.For categorical attributes, the equal distance is used.
• δ -disclosure privacy.Be a database with only one sensitive attribute S, p(EC, s) the fraction of records with s as sensitive attribute in the equivalence class EC, p(DB, s) that for the whole database (DB).Then, δ -disclosure privacy is verified iff: for every s ∈ D (with D the domain of S) and every equivalence class EC 7 .
The motivation for including the nine techniques outlined above, and not a smaller number of them or just the most classic ones (e.g.k-anonymity or -diversity), is that they are not all useful against the same types of attacks.That is, there are techniques that are very useful against certain attacks, but cannot prevent from others.In particular, Table 1 briefly describes some of the most common attacks that databases can suffer, namely: linkage, re-identification, homogeneity, background knowledge, skewness, similarity and inference attacks.

Attack Description
Linkage Consists of combining at least two anonymized databases in order to reveal the identity of some individuals present in both.Re-identification This kind of attacks occurs when the anonymization process is reversed.
Homogeneity Can occur when all the values for a sensitive attribute in an equivalence class are identical.Background knowledge In this case, the adversary has some foreknowledge about the target of the attack e.g.knows some auxiliary information about an individual in the database).Skewness Can be carried out when there is an unfrequent value for a sensitive attribute in the whole database which is extremely frequent in an equivalence class.Similarity May occur when the values of a sensitive attribute in an equivalence class are semantically similar (although different).Inference Consists of using data mining techniques in order to extract information from the data.

Table 1. Common attacks on databases and description.
In addition, Table 2 shows the most convenient techniques (although not the only ones) that can be applied to prevent each of the previously mentioned attacks.It is important to take into account that the values of t and δ for t-closeness and δ -disclosure privacy respectively must be strictly greater than the ones obtained using pyCANON (see the definition of that techniques).

Principal attack which prevents
On the other hand, it should be noted that although the anonymization techniques have been presented for the case where the database consists of a single sensitive attribute, they can nevertheless be applied in the case of multiple sensitive attributes.The latter may be quite common since in many use cases there are several attributed deemed to be sensitive information.Specifically, this library implements two approaches that can be followed in this case: 1.In the simplest case, generalization is applied to the case of multiple sensitive attributes.That is, for each sensitive attribute (SA) each of the properties is checked, and the parameter that is satisfied for all of them is kept (e.g. for -diversity the smallest value of once it is computed for each SA will be kept, while the value of α for (α,k)-anonymity will be the largest value of α of those obtained for each SA).In the following we will refer to this approach as the generalization one.
2. In the second approach, we will have to update the set of quasi-identifiers according to the sensitive attribute to be analyzed, and proceed again as in the previous case.That is, be Q initial the set of quasi-identifiers, and S the set of sensitive attributes.For each sensitive attribute S i ∈ S, the set of quasi-identifiers considered in each case would change (and with it the different equivalence classes), so that it will be Q ∪ (S \ S i ).In the following we will refer to this approach as that of the quasi-identifiers update.The idea of introducing this second approach is that in certain cases an attacker could know some of the sensitive attributes, thus acting as quasi-identifiers that would allow inferring information about the rest of the sensitive information.
As will be explained in the following, pyCANON implements the two previously exposed approaches, and it is up to the user to select which one to use.One of the main points to take into account is that the second approach, in which the set of quasi-identifiers is updated according to the sensitive attribute to be analyzed, is more computationally expensive, since for each sensitive attribute the equivalence classes must be recalculated (because the set of quasi-identifiers varies, in addition to being larger).The different functions available on this framework are shown in the following section, together with some use examples.

Discussion
As already stated, the purpose of our tool is to assess the level of anonymity of a dataset with regards to the most common techniques, being complementary to other anonymization tools like ARX.In this section, a battery of applications that validate the impact and usefulness of this library for different purposes is discussed.In particular, different examples are presented using openly available data, so that the reproducibility of the results is ensured.However, note that as will be explained later, many of these data have been anonymized using the open source software ARX 2 , establishing different levels of hierarchies for the generalization of the quasi-identifiers, which could cause the results to vary according to the different hierarchies introduced.
Suppose we have a dataset and a list of quasi-identifiers for which we want to get the data to verify k-anonymity with, for instance, k = 5.To do this, we can use a software like ARX, selecting as quasi-identifiers the columns of the data set that interest us.Suppose that by mistake, one of the quasi-identifiers is introduced as an insensitive attribute in the anonymization process (for example, in ARX there is a drop-down to distinguish the type of attribute, so it is not complex to imagine a possible human error).Once anonymized using this software, we obtain a dataset verifying k-anonymity with k = 5 for the columns included as quasi-identifiers.Then, pyCANON can be used to check if the new dataset truly verifies k-anonymity for k = 5 for the initial given list of quasi-identifiers.However, suppose that by doing this check we get a value k = 1, i.e., there is at least one set of quasi-identifiers which only has a single individual (an equivalence class with only one row).This allows us to quickly detect that there has been a failure in the anonymization process (in this case, at least one of the columns that should be a QI, was introduced by mistake in ARX as insensitive value).In this way, by checking the anonymity level with pyCANON we have managed to prevent a bug that can be very common, but that would allow an attacker to extract unwanted information about the dataset.
Another example of using the library will be now presented with the stroke dataset 8 with the following set of quasi-identifiers (QI) and sensitive attributes (SA): QI = ['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'] and SA = ['stroke'].Again, using ARX k-anonymity will be applied for different values of k, namely k = 2, 5, 10, 20, 25.However, for example by setting k = 20 we can check with pyCANON that the actual k value obtained for the resulting dataset is k = 22.This does not mean that the anonymization done with ARX is wrong, because if k-anonymity is verified for k = 22, it is evidently verified for k = 20.In fact the software itself warns us of this if we enter the data, where it indicates that the size of the smallest equivalence class is 22.This is probably because with the generalization allowed, the greater value of k which verifies k-anonymity with k = 20, is 22.Similarly, if in this same example we set k = 25 in ARX, then we can check with pyCANON that the maximum value for k is 27.

4/9
Another practical example is the following: suppose we have a dataset to which k-anonymity has been applied for a certain value of k, again, for example, using ARX.We may be interested, without applying any further technique, to check some of the other previously exposed anonymization techniques, and compare how these scale as a function of the value of k.For example, given the drugs dataset 9 with QI = ['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K'] and SA = ['Drug'], anonymized using ARX only considering k-anonymity for different values of k, namely k = 2, 5, 10, 15, 20.Let's see in Table 3  Thus, as can be seen in Table 3, it is enough to apply the k-anonymity process (which requires little computational cost and, depending on the size of the dataset, in ARX is done in a few seconds), to obtain better values for the parameters of other anonymization techniques, such as increasing the value of for -diversity, or decreasing the value of t for t-closeness.It can also be applied to a dataset that has not been previously processed, to check which techniques should be applied, which are the most significant quasi-identifiers (that provide more information), etc.For example, given an unprocessed dataset, the values of k and obtained with the different permutations of the quasi-identifiers could be calculated, in order to evaluate the possibility of eliminating some of them (cataloging them as sensitive attributes for instance).
The same has been done again for the adult database, with the quasi-identifiers and sensitive-attributes given in the first example of the Usage Examples subsection.Specifically, the dataset has been anonymized using ARX considering all k ∈ [2, 10] ∩ N for k-anonymity.Figure 1 shows the evolution of the values of t and β (the latter is shown in logarithmic scale for better visualization) for t-closeness and basic β -likeness respectively, removing the values suppressed by ARX.A substantial decrease is produced both for the values of t a β (remember that t is strictly greater than the value obtained using pyCANON).Finally, let us see the results obtained with pyCANON for three different techniques of the nine described above after anonymizing using the ARX Software.The adult dataset will be used with the quasi-identifiers and sensitive attribute presented in the previous examples.With this example we want to show the usefulness of this library to check that an anonymity process 5/9 has been carried out correctly.Tables 4 and 5 show the results obtained by setting certain values of t and δ for t-closeness and δ -disclosure respectively in ARX, and that obtained using pyCANON (without removing the records suppressed when anonymizing with ARX).It should be noted that in the case of basic β -likeness, differences may occur depending on the function used to calculate the distance.In the case of pyCANON, the relative distance is implemented.However, Table 6 shows five examples where the values established with ARX and the results with pyCANON are compared.The same will occur with other methods depending on the strategy or definition considered in each case, as they may differ.
In addition, several test have been carried out for the value of k of k-anonymity, using five different datasets.All of then can be found in the folder examples of the original framework repository.In these examples, a case with several sensitive attributes is included, and the values obtained with each of the two proposed approaches are studied.
All these functionalities will allow researchers and users in general to improve their practices regarding the publication of data using anonymization techniques, thus helping to raise awareness of the importance of the anonymization process.That is, it will enable data to be published with greater security guarantees.Again, if a database has been anonymized with a particular software, it would not be very appropriate to check the anonymization level with it, because if there is a bug, it would not be detected when checking.However, pyCANON is a library that allows to check the correct application of these techniques, as well as to test for which parameters those techniques that have not been consciously applied would verify with.This would change the daily practices of those who are in charge of publishing data, but also for those institutions, centers or researchers who collaborate by sharing data between them, because they could be sure of the privacy level of the information which they are going to share.
It should be noted that this library does not require strong knowledge of Python language, since it is designed so that the user only has to enter a pandas dataframe with the data (or even the path to the file where the data is stored in the case of using the CLI), a string list with the names of the columns that are quasi-identifiers, and another with those that are sensitive attributes.This makes it a library accessible to the general public wishing to check the anonymity level of a dataset regardless of its background.The only condition is that the data must have one of the following extensions: .csv,.xls,.txt,or .sav,or be stored in a pandas dataframe.

Methods
In this library, the user is provided with a set of functions to calculate the anonymity level of a dataset based on the techniques mentioned previously.In addition, for the case of multiple sensitive attributes, it allows to know the results for the two approaches previously exposed, namely: generalization or quasi-identifiers update.

6/9
Specifically, a function for the calculation of each of the anonymity techniques presented has been implemented in the package anonymity.Be data a pandas dataframe containing the dataset under study.Be qi the list of quasi-identifiers and sa that of the sensitive attributes.The parameter gen indicates whether to generalize for the case of multiple SA: If true (default) generalization approach is applied, if False, the set of QI is updated for each SA.Then, the different functions used to calculate the parameters for the previously mentioned properties are that exposed in Table 7. Remember that the values of t and δ for t-closeness and δ -disclosure privacy must be strictly greater than the ones obtained using the functions of pyCANON.
In addition, the package report is a key functionality of this library.Again data, qi, sa and gen are defined as in the previous example.The purpose of this package is to generate a report with the anonymization level of the data file entered, checking all the techniques mentioned above.This report can be generated into a JSON file, a PDF file, or displaying it on the screen.An example will be shown in the following Usage examples Subsection, but the basic schema using this package is represented in Figure 2.Besides being a library with the two packages mentioned above: anonymity and report, pyCANON is also implemented as a command line interface (CLI).The latter allows the user to execute the code via a simple command line with interactive use via text input.
Finally, it is important to note that the overall structure of the library includes, among others, unit tests, documentation, and files with example data as well as their corresponding tests.Also, note that the documentation of the library can be found at https://readthedocs.org/projects/pycanon/.

Usage examples
Let us start by presenting an example for a very classical dataset: adult dataset (see 10 ).Specifically, the following quasiidentifiers (QI) and sensitive attributes (SA) have been chosen: QI = ['age', 'education', 'occupation', 'relationship', 'sex', 'native-country'] and SA = ['salary-class'].Be FILE_NAME a string with the path to the .csvfile where the mentioned Example Code 2. Use example of the report package

Future work
In the first release of this library, nine techniques exposed when presenting the methods have been included, but this software may be extended in future versions with new functionalities.Among them, the idea of applying techniques such as δ -presence or k-map, for which it is necessary to use an auxiliary population, seems really attractive.In addition, it is also proposed for future updates and improvements of the library to include, together with the classification report that can be generated towards the report package, personalized recommendations on the values obtained: for example, to inform that it is advisable to have a value of for -diversity strictly greater than 1 or that is recommended for the value of α for (α,k)-anonymity to be strictly lower than 1.Furthermore, these recommendations could be customized according to the number of data, as well as quasi-identifiers and sensitive attributes involved.

Figure 1 .
Figure 1.Evolution of t for t-closeness and log(β ) for basic β -likeness when varying the value of k for k-anonymity.

Figure 2 .
Figure 2. Schema for obtaining the anonymization report from the data, list of quasi-identifiers and sensitive attributes.

Table 2 .
Anonymization techniques and principal attacks that prevent (among others).

Table 3 .
how scale, for example, the values of for -diversity, t for t-closeness, and β for basic β -likeness (without removing the values suppressed when anonymizing with ARX).Values of for -diversity, t for t-closeness, and β for basic β -likeness obtained for a prefixed value of k for k-anonymity considering the drug dataset.
* k = 15 has been set with ARX, and can be checked with pyCANON that the resulting data also verify k-anonymity for k = 17.** k = 20 has been set with ARX, and can be checked with pyCANON that the resulting data also verify k-anonymity for k = 21.

Table 4 .
Values for t-closeness using ARX and pyCANON.