A unified method to revoke the private data of patients in intelligent healthcare with audit to forget

Revoking personal private data is one of the basic human rights. However, such right is often overlooked or infringed upon due to the increasing collection and use of patient data for model training. In order to secure patients’ right to be forgotten, we proposed a solution by using auditing to guide the forgetting process, where auditing means determining whether a dataset has been used to train the model and forgetting requires the information of a query dataset to be forgotten from the target model. We unified these two tasks by introducing an approach called knowledge purification. To implement our solution, we developed an audit to forget software (AFS), which is able to evaluate and revoke patients’ private data from pre-trained deep learning models. Here, we show the usability of AFS and its application potential in real-world intelligent healthcare to enhance privacy protection and data revocation rights.


INTRODUCTION
R EVOKING personal private data is one of the basic human rights, which has already been sheltered by privacy-preserving regulations like The General Data Protection Regulation (GDPR) [1], The Health Insurance Portability and Accountability Act of 1996 (HIPAA) [2], and the California Consumer Privacy Act [3] since 20 th century.With those regulations, users are allowed to request the deletion of their own data for privacy concerns and to secure their own 'right to be forgotten'.However, with the development of data science, machine learning (ML) and deep learning (DL) techniques, this basic right is usually neglected or violated.For example, it has been observed that patients' genetic markers were leaked from ML methods for genetic data processing [4], [5] while the patients were unaware of that.When users realize the existence of such risks, they may request their own data to be deleted to protect their privacy [6].Meanwhile, those aforementioned regulations will force involved third parties to take actions immediately.According to the requirements of those regulations, not only the previously authorized data by individuals need to be deleted immediately from hosts' storage systems but also 1 Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia 2 Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia * Corresponding author # Equal contribution the associated information should be removed from DL models trained with those data, because DL models could memorize sensitive information of training data and thus expose individual's privacy under risk [7], [8], [9], [10], [11].
Nowadays, healthcare is one of the most promising areas for the deployment of artificial intelligent (AI) systems as socalled intelligent healthcare.ML and DL-based computeraided diagnosis (CAD) systems in intelligent healthcare accelerate the diagnosis of various diseases and achieve even better results than doctors, such as tumour detection [12], [13], retinal fundus imaging [14], detection and segmentation of COVID-19 lung infections [15], [16] and so on.However, as more and more patients' data are being collected and used for model training in intelligent healthcare, their privacy is exposed to high risk.Therefore, intelligent healthcare is a sector where technology must meet the law, regulations, and privacy principles to ensure that the innovation is for the common good [17].To obey those privacy-preserving regulations, methods to revoke personal private data from pre-trained DL models are necessary.
Deleting the stored personal data is simple, whereas forgetting individuals' private information from pre-trained DL models could be difficult as we could not fully measure the contribution of individual data on the training process of DL models due to the stochasticity of training [18].Besides, due to the incremental nature of training, the model update brought by one sample would affect the model performance arXiv:2302.09813v1[cs.LG] 20 Feb 2023 on samples followed, thus making it difficult to unlearn [18].Finally, catastrophic unlearning might happen and the unlearned model will perform worse than the model retrained on the remaining dataset [19].
In general, the process to forget data from a pre-trained DL model could be divided into two steps.Firstly, the unlearning process (forgetting) is performed on a given pretrained DL model to forget the target data with different techniques and a new DL model will be generated.Secondly, an evaluation of the new model (auditing) against different metrics will be performed to prove that the model has forgotten the target data.These two processes should be repeated until the new model passes the evaluation.In simple terms, there are two commonly acknowledged sub-tasks, which could also be stated in the reverse order: auditing and forgetting, as a two-player game.Auditing requires auditors to precisely evaluate whether the data of certain patients were used to train the target DL model.Once the data of certain patients is confirmed to be used to train the target DL model by auditing, forgetting requires the removal of learnt information of certain patients' data from the target DL model, which is also called machine unlearning, while auditing could act as the verification of machine unlearning [18] In order to achieve forgetting, existing unlearning methods could be classified into three major classes, including model-agnostic methods, model-intrinsic methods and data-driven methods [20].Model-agnostic methods refer to algorithms or frameworks that can be used for different DL models, including differential privacy [18], [21], [22], certified removal [23], [24], [25], statistical query learning [6], decremental learning [26], knowledge adaptation [27], [28] and parameter sampling [29].Model-intrinsic approaches are those methods designed for specific types of models, such as for softmax classifiers [30], linear models [31], treebased models [32] and Bayesian models [19].Data-driven approaches focus on the data itself, including data partitioning [18], data augmentation [33], [34], [35] and other unlearning strategies based on data influence [36].All methods have their specific application scenarios and limitations.Among the three methods, model-agnostic methods might have the strongest application prospects, as they can be applied to different models.Still, more mechanisms and theoretical concepts are being proposed to explore different solutions to the forgetting task but few of them focused on the application in real-world intelligent healthcare.
When forgetting is accomplished, auditing is the next necessary step to verify it.Different metrics have been proposed to audit the membership of the query dataset, including accuracy, completeness [6], unlearn time, relearn time, retrain time, layer-wise distance, activation distance, JS-divergence, membership inference [37], [38], ZRF score [27], epistemic uncertainty [39] and model inversion attack [7].In recent studies, membership inference-based metrics were frequently utilized to determine whether or not any information about the samples to be forgotten was retained in the model in intelligent healthcare [38].A black-box setting was shared by the membership inference attack (MIA) to calculate the probability of a single datapoint being a member of the training dataset D. Based on this individual level MIA, Liu et al. [37] and Yangsibo et al. [38] focused on a more challenging task: audit the membership of a set of data points.The ensembled membership auditing (EMA) [38] was proposed as the state-of-the-art method to verify whether a query dataset is memorized by a pre-trained DL model, which is also a benchmark metric in machine unlearning.However, due to the black box property of DL models, efficient and accurate auditing is still challenging and an under-studied topic.Moreover, researchers have tended to treat auditing and forgetting as separate tasks, ignoring the fact that the two can be linked up associatively to work as a self-consistent mechanism.
Here, we proposed a novel solution by using auditing to guide the forgetting process in a negative feedback manner.We unified the two tasks by introducing knowledge purification (KP), a new approach to selectively transfer the needed knowledge to forget the target information instead of simply transferring all information like knowledge distillation (KD) [40].On the basis of KP, we have developed a user-friendly and open-source method called AFS, which can be easily used to revoke patients' private data from DL models in intelligent healthcare.To demonstrate the generality of AFS, we applied it to four tasks based on four datasets, including the MNIST dataset, the PathMNIST dataset, the COVIDx dataset and the ASD dataset, with different data sizes and various architectures of deep learning networks.Our results demonstrate the usability of AFS and its application potential in real-world intelligent healthcare.

The overall framework of AFS
AFS is a novel and unified method to revoke patients' private data by using auditing to guide the forgetting process in a negative feedback manner (Figure 1).
To audit the membership of the query dataset, AFS takes a pre-trained DL model and the query dataset as inputs, and determines whether the query dataset has been used for training the target DL model.This function was reimplemented based on EMA [38], a published MIA-based method to evaluate the membership of a query dataset.1. AFS is a unified method to revoke patients' private data in intelligent healthcare.Given a pre-trained DL model and a query dataset, AFS could audit and provide confidence whether the query dataset has been used to train the target DL model.When a dataset has been used to train the target DL model, AFS could effectively remove the information about the dataset from the target DL model with the guidance of auditing.To achieve that, we proposed a novel method called knowledge purification, which utilizes results from auditing as feedback to forget information.Our re-implementation allows quicker and easier usage of auditing (Section 2.4).
To forget the query dataset from a DL model, AFS takes the pre-trained DL model and the query dataset to be forgotten as inputs, in which the query dataset has been used to train the DL model.To effectively forget the information of the query dataset from the pre-trained DL model, an idea is to transfer the information of the remaining dataset except for the query dataset from the pre-trained model to a new model.Therefore, we designed a novel mechanism called knowledge purification (KP) by using auditing to guide the forgetting process to exclude the information of the query dataset while transferring the remaining information by incorporating the auditing loss into the training process (Figure 2).With KP integrated, AFS could generate a new model, in which the information of the target dataset should be forgotten under the guidance of auditing (Section 2.5).
To provide an applicable solution, we implemented AFS as open-source software that provides a user-friendly entry point allowing users to use both functions with only one command.To demonstrate the generality of AFS, we applied it to four tasks based on four datasets, including the MNIST dataset, the PathMNIST dataset, the COVIDx dataset and the ASD dataset, which have different data sizes (Figure 3 and Section 2.2) and various architectures of deep learning networks (Section 2.3).

Dataset preparation
We used four public datasets that were commonly acknowledged in the machine learning and intelligent healthcare field to demonstrate the versatility of AFS.For the benchmark experiment, we applied AFS on MNIST [41] and PathMNIST [42] from the MedMNIST [43] dataset.The MNIST dataset contains 60,000 training images and 10,000 testing images of handwritten digits with size 28×28 and labelled from 0 to 9. PathMNIST contains 100,000 nonoverlapping image patches from hematoxylin & eosin stained histological images and 7,180 image patches from different clinical centres.In total, 9 types of tissues are involved in the PathMNIST dataset, including adipose, background, debris, lymphocytes, mucus, smooth muscle, normal colon mucosa, cancer-associated stroma, and COAD epithelium.
All images in PathMNIST were 224 × 224 (0.5 µm/px) and were normalized with the Macenko method [44].For the application of AFS in intelligent healthcare, we used the COVIDx [45] dataset, which contains 13,975 chest X-ray (CXR) images across 13,870 patient cases, and the Autism spectrum disorder (ASD) dataset for toddlers [46], which contains 20 features of 1,054 samples to be utilized for determining influential autistic traits and improving the classification of ASD cases.
For each dataset, we further sampled partial data as the training dataset, the testing dataset, and the calibration dataset as below: MNIST.We randomly sampled 10,000 images as the training dataset and 10,000 images as the testing dataset.We also randomly sampled 100, 1,000, 2,000, and 5,000 images that are disjoint with the training dataset as four calibration datasets to illustrate the effect of the calibration dataset of varied sizes on auditing and forgetting.
PathMNIST.We randomly sampled 10,000 images as the training dataset and 5,000 images as the testing dataset.We also randomly sampled 1,000 images that are disjoint with the training dataset as the calibration dataset.
COVIDx.We randomly sampled 5,000 images as the training dataset and 1,000 images as the testing dataset.We also randomly sampled 1,000 images that are disjoint with the training dataset as the calibration dataset.
ASD.We randomly sampled 500 images as the training dataset and 100 images as the testing dataset.We also randomly sampled 100 images that are disjoint with the training dataset as the calibration dataset.
For all four datasets, we randomly sampled partial data from the training dataset with percentage k from {0.25, 0.5, 0.75} as the training dataset for knowledge distillation (KD) and AFS.
In addition, we prepared query datasets with different sizes N from {1, 10, 100, 500, 1000, 2000}.A query dataset that completely overlapped with the training dataset is labelled as QO, while the query dataset that is completely disjoint with the training dataset is labelled QNO.To further understand the effect of the purity of the query dataset, we also prepared the query dataset called QM with a k percentage of the query dataset to be overlapped with the training dataset.Finally, for the query dataset designed to be forgotten, we labelled it as QF.

Deep learning models and experiment setup
To present the generalizability of AFS towards various DL models, we adopted different architectures for each of the four tasks, including the multilayer perception [47] (MLP), the convolutional neural network (CNN) [48] and ResNet [49].There were a large DL model and a small DL model for each task, where the large model refers to the original pre-trained model and the small model is the new model generated by AFS.
For the MNIST dataset, we used MLP with 671,754 parameters as the teacher model and 155,658 parameters as the student model to achieve the 10-class classification task.
For the PathMNIST dataset, we adopted CNN with 21,285,698 parameters as the teacher model and 11,177,538 parameters as the student network for the 9-class classification task.
For the COVIDx dataset, we took ResNet34 with 21,285,698 parameters as the teacher model and ResNet18 with 11,177,538 parameters as the student network to achieve the binary classification of healthy people and patients.
For the ASD dataset, we used the MLP with 3,586 parameters as the teacher model and the MLP with 898 parameters as the student model for the binary classification of autism in toddlers.
During model training, the number of epochs was fixed to 50, the learning rate was set to 1e-5 and the Adam optimizer was used.A workstation with 252 GB RAM, 112 CPU cores and 2 Nvidia V100 GPUs were adopted for all experiments.The AFS method was developed based on Python3.7,PyTorch1.9.1 and CUDA11.4.A detailed list of dependencies could be found in our code availability.

Audit the membership of query dataset
EMA [35] is designed as a 2-step process.In the first step, the best threshold for each metric is selected to optimize (T P R(t) + T N R(t))/2 based on the calibration dataset as shown in Algorithm 1. Once the thresholds for all metrics are selected, the membership of each sample in the query dataset will be confirmed as at least one metric is larger than the corresponding threshold.In total, three metrics, including correctness [50], confidence [51], [52], and negative entropy [53], [54], were adopted in AFS as proposed in the previous work [38], [55].
Once the membership of all samples in the query dataset is confirmed in the previous step, the query dataset will be further evaluated to determine whether the query dataset has been used to train the target pre-trained DL model.A two-sample statistical test is adopted to evaluate the query dataset based on the sample-wise membership and an allone vector.The p-value of the two-sample statistical test is used as the output of auditing.Given a user-defined threshold α, if p < α, then users could conclude that the query dataset was not used for training the target DL model.EMA was re-implemented and integrated into AFS to allow easy and fast auditing.The calibration model is trained as for m i ∈ {m 1 , ..., m n } do 5: Compute metrics for training dataset as Compute metrics for test dataset as Find 2 ), where cal | return The thresholds t 1 , ..., t n for n metrics

Audit-guided forgetting of query dataset with AFS
Forgetting aims to remove the remembered information of the query dataset from the target DL model.Similar to knowledge distillation (KD), a teacher-student paradigm was also adopted in AFS, but with an additional requirement to selectively forget information associated with the data we want to forget.Thus, we designed a novel approach called knowledge purification (KP), meaning purifying the knowledge in the teacher model (the original pre-trained model), discarding the information related to the data that needed to be forgotten and transferring the purified information into the student model (the new model).AFS unified auditing and forgetting into a circular process to effectively enhance the unlearning in a negative feedback manner.
As shown in Figure 1, during each epoch of training, the training data will be fed into both the teacher model and the student model, while the data to be forgotten will be audited on the student model.Our main goal is to transfer the knowledge from the teacher model to the student model while forcing the student model to reject the information associated with data to be forgotten.In order to achieve that, we added the audit loss into the total loss, thus allowing the student model to accept partial knowledge from the teacher model and achieve KP as shown in Algorithm 2.

Evaluation metrices
Since all four tasks are either multi-classes classification tasks or binary classification tasks, we adopted the accuracy and F1-score as the evaluation metrics as below, where TP represents true positives, TN stands for true negatives, FN represents false negatives and FP stands for false positives.
To evaluate the membership of the query dataset, the p-value of the two-sample statistical test was used as mentioned previously.

AFS audits private datasets stably and robustly
To evaluate the robustness of auditing by AFS, we used it to audit query datasets with different sizes, various purity (k percent of the query dataset was overlapped with the training dataset) and the different sizes of calibration dataset (the size ranged from 100 to 5000) (Method and Figure 4A).For each sample in the query dataset, AFS calculates three metrics for the membership inference, including correctness, confidence and negative entropy (Method).As shown in Figure 4B, all three metrics showed different distributions for QO (query dataset overlapped with the training dataset) and QNO (query dataset disjoint with the training dataset), indicating the dataset-wise divergence of metrics between samples in the training dataset and samples disjoint with the training dataset.Finally, by integrating these three metrics, AFS predicts a p-value to evaluate whether or not a query dataset has been used to train the target DL model.The large p-values indicate the higher probability that the query dataset was used in training.
When the size of the query dataset and the calibration dataset varied, AFS could still efficiently distinguish QO and QNO (Figure 4C and D).Compared to QO, AFS reported a much smaller p-value for QNO, indicating a weak membership (a small probability that the query dataset has been used to train the target DL model), thus allowing users to judge whether the query dataset was used to train the target DL model.Meanwhile, when the size of the dataset increased from 1 to 2000, AFS discriminated QO and QNO more confidently as there was a more significant divergence of the p-values, which was not affected by the size of the calibration dataset.To further understand the effect of the purity of the query dataset in auditing, we mixed some samples from the training dataset to QNO, thus the new query dataset was labelled as QM (partial data overlapped with the training dataset).The percentage of data overlapped with the training dataset in QM was denoted by k = number of data overlapped with training dataset size of QM .As shown in Figure 4E, AFS showed a decreasing p-value trend when k decreased, meaning that the query dataset was less likely to be used to train the target DL model.In conclusion, these results indicate the robustness of AFS in determining whether the query data has been used to train the target DL model.

AFS forgets the information of query dataset, maintains perfect usability and generates smaller model
Once the prior knowledge that a dataset has been used to train the target DL model is confirmed with auditing, AFS could be used for forgetting, to remove the information of the dataset from the pre-trained DL model.To comprehensively show the ability of AFS in removing infor- Taking the MNIST dataset as an example, for models trained with each method, except for auditing on QO and QNO, we further audited the membership of two datasets designed to be forgotten (a small query dataset QF 100 and a large query dataset QF 1000 ) to assess the ability of different methods in forgetting the query dataset.As shown in Table 1, regardless of the model trained based on which method, AFS could effectively distinguish between QO and QNO, and the divergence in auditing two query datasets was enlarged as the size of the query dataset increased.
As shown in Table 2, AFS perfectly predicted the membership of QF 100 and QF 1000 on both models from Independent teacher and Independent student methods as both query datasets were included in the training dataset.Since both query datasets were disjoint with the partial training dataset when k ∈ {0.25, 0.5, 0.75}, thus auditing on the model trained with Independent student with k ∈ {0.25, 0.5, 0.75} weakly denied the membership of QF 100 (P QF 100,k=0.75 = 4.36E-2, P QF 100,k=0.5 = 6.91E-3,P QF 100,k=0.25 = 6.91E-3) and QF 1000 (P QF 1000,k=0.75= 5.26E-12, P QF 1000,k=0.5= 2.34E-15, P QF 1000,k=0.25 = 2.90E-19).However, since only the partial training dataset was used when k ∈ {0.25, 0.5, 0.75}, the retrained models with Independent student only learnt the information of the partial training dataset and lost the information from the remaining data in the complete training dataset, thus resulting in the significant drop of model performance compared to either the Independent student or the Independent teacher trained with the complete training dataset.
To rescue the information lost due to the usage of partial training samples and further increase the model performance, AFS could use only a partial training dataset (k ∈ {0.25, 0.5, 0.75}) to transfer the knowledge from the Independent teacher pre-trained with the complete training dataset.As shown in Table 2, the model trained with AFS provided higher accuracy and F1-score compared to the Independent student trained with partial training dataset (k ∈ {0.25, 0.5, 0.75}) and together with a better forgetting performance (much smaller auditing score on QF 100 and QF 1000 ), as AFS used auditing as feedback for forgetting and could forget not only the query samples but also other samples with similar features.
We also applied AFS on the 9-classes classification of hematoxylin & eosin-stained histological images from the PathMNIST dataset with CNN.As shown in Table 3, AFS could still distinguish QO and QNO from the PathMNIST dataset.The divergence of auditing between QO and QNO was more significant than that on the MNIST dataset.With the requirement to forget both query datasets (QF 100 and QF 1000 ), the model trained with AFS outperformed on forgetting information (P QF 100,k=0.75 = 2.25E-5, P QF 100,k=0.5 = 2.87E-6, P QF 100,k=0.25 = 3.32E-7), P QF 1000,k=0.75= 2.05E-41, P QF 1000,k=0.5= 4.75E-35, P QF 1000,k=0.25 = 1.84E-56) while learnt more information from the Independent teacher model trained with a complete training dataset.
In summary, AFS could effectively forget the information of the query dataset from the target DL model.Since KP was integrated into AFS, it could generate a smaller DL model, which masters knowledge from the larger teacher model by using only a partial training dataset (k = 0.5 could achieve a good balance between forgetting and model performance), without the need to retrain the larger model with the complete training dataset.Compared to retraining the student model, the model trained with AFS showed even better performance in forgetting the information while maintaining better model performance (accuracy and F1score) as it learnt the knowledge from the model trained with the complete training dataset.As shown by the ablation study in Tables 2 and 4, compared to AFS w/o Audit, the audit-guided AFS could forget the information more significantly but with an acceptable cost in decreasing the model performance (accuracy and f1-score).

Apply AFS to forget medical images
To show the versatility of AFS, we applied it to the classification of pneumonia and normal with chest X-ray images from the COVIDx dataset with ResNet, which is a classic task in medical image analysis.As shown in Figure 5A, on both query datasets (QF 100 and QF 1000 ), AFS could effectively forget the information of the query dataset, while generating the new model with much less number of parameters as shown in Figure 5B.Surprisingly, the model generated by AFS showed even better accuracy than the Independent teacher trained with the complete dataset and the Independent student trained with the partial training dataset.This result not only indicated that AFS could effectively transfer the knowledge from the teacher model to the student model but also suggested that the student model with simpler architecture could even perform better than the teacher model with KP in AFS due to the reduction of model parameters and purification of knowledge in some real-world cases.

Apply AFS to forget electrical health records
To further prove the generalizability of AFS in both the auditing and forgetting, we applied AFS to predicting early autism spectrum disorder (ASD) traits of toddlers, which contains sensitive information about patients, such as the age, gender and the family gene trait.That information was stored as electrical health records (EHR).As shown in Figure 5A, similar to previous results on other datasets, AFS effectively removed the information of both query datasets from the pre-trained DL model.Since the size of the ASD dataset was quite small, we adopted two smaller query datasets (QF 50 and QF 100 ) to be forgotten.Compared to the models trained with other methods, the model trained with AFS successfully forgot the information of both QF 50 (P QF50,k=0.75= 0.08, P QF50,k=0.55= 0.08, P QF50,k=0.25 = 0.156) and QF 100 (P QF100,k=0.75= 0.004, P QF100,k=0.55= 0.007, P QF100,k=0.25 = 0.007) without affecting the model utility significantly (Acc AF S,k=0.75 = 0.98, Acc AF S,k=0.5 = 0.98, Acc AF S,k=0.25 = 0.98).

DISCUSSION
To our knowledge, AFS is the first unified method of auditing and forgetting that could effectively forget the information of the target query dataset from the pre-trained DL model with the guidance of auditing.We designed AFS as a model-agnostic and open-source method that is applicable to different models.As shown in Figure 5C, AFS could generate a smaller model, which requires much less time and GPU memory during the inference (Tables 5  and 6), by training with a partial training dataset (∼50%) with our novel KP approach.Moreover, AFS could forget the information of the query dataset at the expense of an acceptable reduction in the model performance.
Our experiments on four datasets showed that AFS was generalized for datasets of different sizes and forms, including medical images and EHR.Since deep learning models with different architectures were applied to four tasks, we further demonstrated the broad applicability of AFS to common deep learning models.In addition, our tasks include both binary classification and multiclassification tasks, which also suggested that AFS was applicable for tasks with multiple labels.
With current laws that guarantee people the right to revoke their own data, AFS could help institutions and companies to efficiently iterate their models to forget individual information at the model level.However, there are still some shortcomings in the application of the current version of AFS in the production environment, which could be the main potential direction of research in the future.Firstly, the models and data we tested in this study were still not large enough compared to the data in the real production environment.Therefore, it is unknown whether scaling AFS to larger models and more data will cause new problems.Secondly, there are different approaches to audit, and thus we could add more metrics of auditing to AFS to guide the forgetting process in the future version.Finally, due to the limitation of auditing, it is still difficult to perform individual-level forgetting, as we need to compare the difference in statistical distribution based on a fraction of data points, which could be the major possible improvement for the future version of AFS.Despite these limitations, we believe that AFS will make a valuable contribution towards better protection of people's privacy and the right to revoke the data with the rapid development of intelligent healthcare.

CODE AVAILABILITY
The AFS software is publicly available at https://github.com/JoshuaChou2018/AFS

Fig.
Fig.1.AFS is a unified method to revoke patients' private data in intelligent healthcare.Given a pre-trained DL model and a query dataset, AFS could audit and provide confidence whether the query dataset has been used to train the target DL model.When a dataset has been used to train the target DL model, AFS could effectively remove the information about the dataset from the target DL model with the guidance of auditing.To achieve that, we proposed a novel method called knowledge purification, which utilizes results from auditing as feedback to forget information.

Fig. 2 .
Fig. 2. Illustration of knowledge distillation and knowledge purification.Knowledge purification requires the selective transfer of the needed knowledge in the process of knowledge distillation to forget the target information instead of simply transferring all information.

Fig. 3 .
Fig. 3. Illustration of four datasets and DL models used to show the versatility of AFS.

Algorithm 1 .
Infer thresholdsRequire: The calibration dataset D cal , the pre-trained DL model A and n different metrics (m 1 , ..., m n ) for membership testing.

: procedure 2 : 3 :Forward D train with F 4 :Forward D train with f 5 : 6 : 7 :
The calibration dataset D cal , the query dataset to forget D f orget , the sampled training dataset D train for KP, the pre-trained DL model F , the new model f , and number of epochs T .1for epoch ∈ {1, ..., T } do Infer threshold with D cal on f and audit D f orget on f to get loss audit Calculate loss AF S = loss classif ication +loss KD + loss audit Update f based on loss AF S return The new student model f with information about D f orget forgotten.

Fig. 4 .
Fig. 4. Performance of auditing using AFS on the four datasets.A. Demonstration of the training dataset, the test dataset, the calibration dataset, and the query dataset overlapped with the training dataset (QO) and the query dataset disjointed with the training dataset (QNO).B. Distribution of three metrics for samples in QO and QNO.C. The performance of auditing when varying the size of the calibration dataset and the size of the query dataset.D. The p-value of auditing on QO and QNO of four datasets.E. The p-value of auditing when varying k of the query dataset of four datasets.

Fig. 5 .
Fig. 5. Performance of forgetting using AFS on four datasets.A. The p-value of auditing on a small query dataset and a large query dataset (QF) and the accuracy of models trained with different methods, including Original (Independent teacher trained with the complete training dataset), Data Deletion (the Independent student model trained with partial training dataset and k = 0.5), AFS (w/o Audit) and AFS.B. The number of parameters for the original large model and the new small model generated by AFS. C. The qualitative evaluation of three methods, including Original (Independent teacher trained with the complete training dataset), Data Deletion (the Independent student model trained with partial training dataset and k = 0.5), and AFS on five dimensions (Ability to forget, accuracy, size of dataset needed for training, size of the generated model and the efficiency of training).A larger value means a stronger ability to forget, higher model accuracy, a smaller size of dataset needed for training, a smaller size of the generated model, and better efficiency of training.

TABLE 1
Comparison of AFS with other methods on auditing QO and QNO from the MNIST dataset with a varied number of samples in the query dataset.The data in the table shows the results of auditing QO and QNO on models trained by different methods.A larger value indicates stronger membership.

TABLE 2
Comparison of AFS with other methods on forgetting QF and model performance with the MNIST dataset.QF 100 is the small query dataset containing 100 samples and QF 1000 is the large query dataset containing 1000 samples.We present the p-values of auditing models trained with different methods on QF 100 and QF 1000 and the model performance including the accuracy and F1-score.
100 and QF 1000 were included in the training dataset, while these two query datasets were excluded from the training dataset when k ∈ {0.25, 0.5, 0.75}.

TABLE 3
Comparison of AFS with other methods on auditing QO and QNO from the PathMNIST dataset with a varied number of samples in the query dataset.The data in the table shows the results of auditing QO and QNO on models trained by different methods.A larger value indicates stronger membership.

TABLE 4
Comparison of AFS with other methods on forgetting QF and model performance with the PathMNIST dataset.QF 100 is the small query dataset containing 100 samples and QF 1000 is the large query dataset containing 1000 samples.We present the p-values of auditing models trained with different methods on QF 100 and QF 1000 and the model performance including the accuracy and F1-score.

TABLE 5
Time for inferring 100 samples with the original model and the model generated by AFS.