Background & Summary

Open access or shared research data must comply with the Health Insurance Portability and Accountability Act (HIPAA) regulations that govern patient privacy. These regulations require the de-identification or removal of protected health information (PHI) and other personally identifiable information (PII) from datasets before they can be made publicly available. The Cancer Imaging Archive (TCIA)1 of the National Cancer Institute (NCI), is one of the largest and most trusted public archives of de-identified cancer images. Over the years, TCIA has developed image de-identification tools and protocols that combine automated and manual de-identification processes. This approach has proven effective for the de-identification of DICOM radiology imaging and digital pathology whole-slide imaging (WSI) submitted to TCIA.

The process of image de-identification and curation is time consuming, requires significant resources, and is prone to human fatigue and error. Automated image de-identification algorithms require evaluation before they can be deployed to process data for open access. This evaluation requires a robust dataset that can be used as a part of assessing image de-identification algorithms. We set out to develop a de-identification evaluation dataset to address that need. Because TCIA is one of the most mature imaging archives with an established and effective image de-identification method, we adopted the TCIA curation process as the current best practice in de-identification. Using TCIA and a newly developed toolset, we created an evaluation dataset by inserting synthetic PHI into already de-identified data.

While it is common to assume de-identification and anonymization are synonymous, in this document we follow Kushida et al.2,3 who make a clear distinction between these concepts: “De-identification of medical record data refers to the removal or replacement of personal identifiers so that it would be difficult to re-establish a link between the individual and his or her data. Anonymization refers to the irreversible removal of the link between the individual and his or her medical record data to the degree that it would be virtually impossible to reestablish the link.” Throughout this document, we will only deal with de-identification.

The evaluation dataset described in this data descriptor is a subset of a larger evaluation dataset created under contract for the National Cancer Institute. We published this subset on TCIA and explained it here to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.

Methods

The full process of generating the evaluation dataset and de-identified evaluation dataset, which serves as an example result of applying a complete de-identification process to the evaluation dataset, is summarized in Fig. 1. Note that in this document, the terms “subject” and “patient” are used as synonyms.

Fig. 1
figure 1

Schematic description of the processing steps involved in the creation of the evaluation dataset and de-identified evaluation dataset.

Images selected from TCIA

To build the evaluation dataset, we selected imaging studies from TCIA to represent a broad cross-section of the current TCIA public collections. Table 1 breaks down the content of the evaluation set into the total number of patients studies, series and images per modality, anatomy imaged by modality and manufacturers of imaging equipment used to collect the data. No images of heads were included to avoid subjects being identified by facial features4,5. The total image count for the evaluation set is 1,693 images that consist of 21 patients, 22 studies, and 26 series for a total of 609 MB of data.

Table 1 Evaluation Dataset Characterization.

Implants

A handful of images containing medical implants were visually inspected for PHI by a trained member of TCIA’s curation team. It is important to visually inspect implant devices because they could contain a serial number that could be used to identify the patient6. If PHI is found, it should be removed or obscured in the image, and if not possible, then the image should not be published. In our selected images, we did not see any information that would warrant alteration or removal of images. Users of this dataset could be instructed to obscure the model numbers as a test of this capability, but normally they would not be required to make such modifications as model numbers do not constitute PHI since model numbers in general are not traceable back to an individual.

DICOM Standard and Manufacturer’s Private Attributes Using Audit Logs

TCIA audit logs are updated whenever curators make any adjustments to DICOM information objects (including image headers) to remove potential PHI. These audit logs represent the complete provenance of the changes made to transform the submitted data into the published information objects7. The logs contain the before/after/replaced values of all DICOM standard Attributes and manufacturer’s Private Attributes8.

When DICOM data are submitted to TCIA, Private Attributes are de-identified according to the DICOM Retain Safe Private Option9 that allows for the retention of data stored in Private Attributes that do not hold PHI. Retention decisions are based on the extensive Private Attribute dictionary maintained by TCIA, which contains all the Private Attributes ever submitted to TCIA8. The dictionary also contains the process operation description (POD) used to modify the data in the Private Attribute to accomplish de-identification. The PODs are: (1) kept, (2) hashed, (3) off-set, (4) deleted, or (5) emptied. The choice of which POD to employ in a given instance is based on the Attribute Type and definition, e.g., DICOM unique identifiers (UIDs) are hashed, dates are off-set.

We stratified the coded data from the audit log by a combination of variables, including whether or not the DICOM Attribute is standard or private, DICOM Attribute description, and the TCIA process operation. A Pareto analysis10 was performed to determine the vital few data element/operation combinations that occur with the greatest frequency. Subsets of the results of this analysis can be found in Tables 2 and 3.

Table 2 lists examples of standard DICOM Attributes. Table 3 shows examples of Private Attributes; both tables list the Data Element tags (group and element number combination from the DICOM data dictionary) and the frequency counts of each. It should be noted that data fields listed do not always signify that PHI was seen during the de-identification process. Only that the potential for PHI existed and actions were taken to ensure that no PHI made it through the curation process.

Table 2 Unusual DICOM attributes containing PHI.
Table 3 Private DICOM Attributes containing PHI.

Generation of synthetic data

Synthetic PHI data elements were generated using the Python package Faker (https://pypi.org/project/Faker, version 4.1.2). In addition to data elements one might expect to contain PHI, e.g., Patient Name and Address, we identified common Attributes, such as Study Description, which could potentially contain useful information while also containing PHI. These Attributes were selected for potential synthetic PHI insertion to demonstrate that deleting or emptying Attributes indiscriminately is not always the best solution, rather the information in the Attribute needs to be modified to retain important information while removing PHI.

Selecting research critical fields and adherence to DICOM standard

In the DICOM standard, each Attribute is assigned a Type that specifies whether the Attribute is mandatory, optional, or conditional. The Attribute Type may be dependent on the modality of the image. The five Attribute Types are shown in Table 4.

We focused only on attributes that were Type 1 (attribute required, valid value required) and Type 2 (attribute required, value may be null). Type 1 C and 2 C attributes are conditional and require a determination if the conditions have been met that dictate whether the Data Element is a type 1 or 2. Therefore, no Type 1 C or 2 C attributes were modified with synthetic-PHI, although we retained Type 1 C and 2 C attributes in the image headers under the assumption that they were properly de-identified during initial TCIA curation. Also note, Attribute Types vary depending on the Service Object Pair (SOP) Class (modality), so we took this into account when generating our list of required Attributes.

Table 4 Attribute Types.

Table 5 shows a subset of the full list of Research Critical Fields we generated, showing the requirements for various DICOM Attributes for different modalities and the types and descriptions of each. The modality column signifies how the Attributes are treated based on modality. For fields where this entry is “All”, the type applies to all modalities. The tag column provides the DICOM group and element tag for the data element that encodes the Attribute, the Attribute column contains the name of the Attribute, the desc column provides a description and conditional requirements, and the Type column identifies the Attribute Type (1, 1 C, 2, or 2 C) as shown in Table 4.

Table 5 General and modality specific data Attributes and Types as specified in the DICOM standard.

Adoption of TCIA Curation as the best practice

There is no clear definition of “important attributes” for secondary research in the research community. Many publications mention important DICOM attributes, but they were related more to the authors’ own research programs than a community-based consensus. Since TCIA is one of the most mature DICOM imaging archives, we adopted the TCIA curation process7, as illustrated in Fig. 2, and resultant dataset as the best practice on this issue.

Fig. 2
figure 2

Schematic description of the standard TCIA Curation Workflow based on the Posda tool suite.

Creation of the evaluation dataset

To create the evaluation dataset, we deployed a process to re-identify DICOM images. For each image that was downloaded from TCIA for a specific patient (by Patient ID / Series ID / Study ID), we overwrote selected DICOM Attribute values with synthetic data. This repopulation of Attribute values was accomplished using version 0.7.5 of Posda7 (https://code.imphub.org/projects/PT/repos/oneposda), the open source package used for curation by TCIA. We created a file specifying the scope (Collection, Patient, Study, Series, Instance) as well as the operations to be performed, which are listed in Table 6. This file was then used by Posda to bulk edit the selected Attributes. For burn-in annotations (text within the pixel data), we extended these editing parameters to include both the text to be inserted and the coordinates of the location of the PHI on the image. Posda used the open source software package ImageMagick (https://imagemagick.org/index.php, version 7.0.9-7) to insert multiple lines of text into the Pixel Data.

Table 6 Re-Identification Operations.

De-identified evaluation dataset

To create an example of how the evaluation dataset would look once re-de-identified using tools and procedures equivalent to those in current use by TCIA, a TCIA curation team that had no knowledge of the evaluation dataset creation process was tasked with the creation of a de-identified version of the evaluation dataset. This de-Identified evaluation dataset follows the standards outlined above as the best practice for de-identification.

MIDI project dataset

The Medical Imaging De-Identification Initiative (MIDI), sponsored by the National Cancer Institute, produced a significantly larger evaluation dataset. After the creation of the full set, 21 records were split off to create the publishable evaluation dataset which is made available on TCIA and described in this publication. Please also note that we are unable to release some elements of the MIDI project due to the need to protect the integrity of the full dataset, which remains the property of the National Cancer Institute.

Data Records

MIDI-Evaluation collection

The evaluation dataset (containing synthetic PHI) and TCIA de-identified evaluation dataset (curated by TCIA) along with crosswalks for both patient IDs and DICOM UIDs between the two datasets have been published11. They may be accessed via the referenced DOI or via the TCIA collection browser as collection Pseudo-PHI-DICOM-Data (https://www.cancerimagingarchive.net/collections/).

Technical Validation

To validate resultant curated datasets, an answer key was created to compare tag states between pre and post-curated datasets. An example of the answer key can be seen in Table 7. The answer key is driven by the actions listed in Table 8 along with action text (list of text retained or removed, etc.) for the various comparisons needed for evaluation. We wrote a Python evaluation script for comparing an answer key to a de-identified dataset. The inputs to the evaluation script are the answer key files along with a Patient ID Crosswalk containing a cross-reference between the old Patient ID and the new Patient ID and a UID Crosswalk for old to new UIDs, which are used for comparison per SOP class included in the collection.

When the TCIA curation team completed their curation task of generating the de-identified evaluation dataset, we compared that dataset to the answer key, and only expected discrepancies (e.g., new UID and Patient ID mapping) were found.

Table 7 Answer key format.
Table 8 Answer Key actions.