Abstract
We developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM objects (a total of 1,693 CT, MRI, PET, and digital X-ray images) were selected from datasets published in the Cancer Imaging Archive (TCIA). Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM Attributes to mimic typical clinical imaging exams. The DICOM Standard and TCIA curation audit logs guided the insertion of synthetic PHI into standard and non-standard DICOM data elements. A TCIA curation team tested the utility of the evaluation dataset. With this publication, the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (the result of TCIA curation) are released on TCIA in advance of a competition, sponsored by the National Cancer Institute (NCI), for algorithmic de-identification of medical image datasets. The competition will use a much larger evaluation dataset constructed in the same manner. This paper describes the creation of the evaluation datasets and guidelines for their use.
Measurement(s) | Deidentification • Clinical Data |
Technology Type(s) | data synthesis • digital curation |
Factor Type(s) | imaging type |
Sample Characteristic - Organism | Homo sapiens |
Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14802774
Similar content being viewed by others
Background & Summary
Open access or shared research data must comply with the Health Insurance Portability and Accountability Act (HIPAA) regulations that govern patient privacy. These regulations require the de-identification or removal of protected health information (PHI) and other personally identifiable information (PII) from datasets before they can be made publicly available. The Cancer Imaging Archive (TCIA)1 of the National Cancer Institute (NCI), is one of the largest and most trusted public archives of de-identified cancer images. Over the years, TCIA has developed image de-identification tools and protocols that combine automated and manual de-identification processes. This approach has proven effective for the de-identification of DICOM radiology imaging and digital pathology whole-slide imaging (WSI) submitted to TCIA.
The process of image de-identification and curation is time consuming, requires significant resources, and is prone to human fatigue and error. Automated image de-identification algorithms require evaluation before they can be deployed to process data for open access. This evaluation requires a robust dataset that can be used as a part of assessing image de-identification algorithms. We set out to develop a de-identification evaluation dataset to address that need. Because TCIA is one of the most mature imaging archives with an established and effective image de-identification method, we adopted the TCIA curation process as the current best practice in de-identification. Using TCIA and a newly developed toolset, we created an evaluation dataset by inserting synthetic PHI into already de-identified data.
While it is common to assume de-identification and anonymization are synonymous, in this document we follow Kushida et al.2,3 who make a clear distinction between these concepts: “De-identification of medical record data refers to the removal or replacement of personal identifiers so that it would be difficult to re-establish a link between the individual and his or her data. Anonymization refers to the irreversible removal of the link between the individual and his or her medical record data to the degree that it would be virtually impossible to reestablish the link.” Throughout this document, we will only deal with de-identification.
The evaluation dataset described in this data descriptor is a subset of a larger evaluation dataset created under contract for the National Cancer Institute. We published this subset on TCIA and explained it here to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.
Methods
The full process of generating the evaluation dataset and de-identified evaluation dataset, which serves as an example result of applying a complete de-identification process to the evaluation dataset, is summarized in Fig. 1. Note that in this document, the terms “subject” and “patient” are used as synonyms.
Images selected from TCIA
To build the evaluation dataset, we selected imaging studies from TCIA to represent a broad cross-section of the current TCIA public collections. Table 1 breaks down the content of the evaluation set into the total number of patients studies, series and images per modality, anatomy imaged by modality and manufacturers of imaging equipment used to collect the data. No images of heads were included to avoid subjects being identified by facial features4,5. The total image count for the evaluation set is 1,693 images that consist of 21 patients, 22 studies, and 26 series for a total of 609 MB of data.
Implants
A handful of images containing medical implants were visually inspected for PHI by a trained member of TCIA’s curation team. It is important to visually inspect implant devices because they could contain a serial number that could be used to identify the patient6. If PHI is found, it should be removed or obscured in the image, and if not possible, then the image should not be published. In our selected images, we did not see any information that would warrant alteration or removal of images. Users of this dataset could be instructed to obscure the model numbers as a test of this capability, but normally they would not be required to make such modifications as model numbers do not constitute PHI since model numbers in general are not traceable back to an individual.
DICOM Standard and Manufacturer’s Private Attributes Using Audit Logs
TCIA audit logs are updated whenever curators make any adjustments to DICOM information objects (including image headers) to remove potential PHI. These audit logs represent the complete provenance of the changes made to transform the submitted data into the published information objects7. The logs contain the before/after/replaced values of all DICOM standard Attributes and manufacturer’s Private Attributes8.
When DICOM data are submitted to TCIA, Private Attributes are de-identified according to the DICOM Retain Safe Private Option9 that allows for the retention of data stored in Private Attributes that do not hold PHI. Retention decisions are based on the extensive Private Attribute dictionary maintained by TCIA, which contains all the Private Attributes ever submitted to TCIA8. The dictionary also contains the process operation description (POD) used to modify the data in the Private Attribute to accomplish de-identification. The PODs are: (1) kept, (2) hashed, (3) off-set, (4) deleted, or (5) emptied. The choice of which POD to employ in a given instance is based on the Attribute Type and definition, e.g., DICOM unique identifiers (UIDs) are hashed, dates are off-set.
We stratified the coded data from the audit log by a combination of variables, including whether or not the DICOM Attribute is standard or private, DICOM Attribute description, and the TCIA process operation. A Pareto analysis10 was performed to determine the vital few data element/operation combinations that occur with the greatest frequency. Subsets of the results of this analysis can be found in Tables 2 and 3.
Table 2 lists examples of standard DICOM Attributes. Table 3 shows examples of Private Attributes; both tables list the Data Element tags (group and element number combination from the DICOM data dictionary) and the frequency counts of each. It should be noted that data fields listed do not always signify that PHI was seen during the de-identification process. Only that the potential for PHI existed and actions were taken to ensure that no PHI made it through the curation process.
Generation of synthetic data
Synthetic PHI data elements were generated using the Python package Faker (https://pypi.org/project/Faker, version 4.1.2). In addition to data elements one might expect to contain PHI, e.g., Patient Name and Address, we identified common Attributes, such as Study Description, which could potentially contain useful information while also containing PHI. These Attributes were selected for potential synthetic PHI insertion to demonstrate that deleting or emptying Attributes indiscriminately is not always the best solution, rather the information in the Attribute needs to be modified to retain important information while removing PHI.
Selecting research critical fields and adherence to DICOM standard
In the DICOM standard, each Attribute is assigned a Type that specifies whether the Attribute is mandatory, optional, or conditional. The Attribute Type may be dependent on the modality of the image. The five Attribute Types are shown in Table 4.
We focused only on attributes that were Type 1 (attribute required, valid value required) and Type 2 (attribute required, value may be null). Type 1 C and 2 C attributes are conditional and require a determination if the conditions have been met that dictate whether the Data Element is a type 1 or 2. Therefore, no Type 1 C or 2 C attributes were modified with synthetic-PHI, although we retained Type 1 C and 2 C attributes in the image headers under the assumption that they were properly de-identified during initial TCIA curation. Also note, Attribute Types vary depending on the Service Object Pair (SOP) Class (modality), so we took this into account when generating our list of required Attributes.
Table 5 shows a subset of the full list of Research Critical Fields we generated, showing the requirements for various DICOM Attributes for different modalities and the types and descriptions of each. The modality column signifies how the Attributes are treated based on modality. For fields where this entry is “All”, the type applies to all modalities. The tag column provides the DICOM group and element tag for the data element that encodes the Attribute, the Attribute column contains the name of the Attribute, the desc column provides a description and conditional requirements, and the Type column identifies the Attribute Type (1, 1 C, 2, or 2 C) as shown in Table 4.
Adoption of TCIA Curation as the best practice
There is no clear definition of “important attributes” for secondary research in the research community. Many publications mention important DICOM attributes, but they were related more to the authors’ own research programs than a community-based consensus. Since TCIA is one of the most mature DICOM imaging archives, we adopted the TCIA curation process7, as illustrated in Fig. 2, and resultant dataset as the best practice on this issue.
Creation of the evaluation dataset
To create the evaluation dataset, we deployed a process to re-identify DICOM images. For each image that was downloaded from TCIA for a specific patient (by Patient ID / Series ID / Study ID), we overwrote selected DICOM Attribute values with synthetic data. This repopulation of Attribute values was accomplished using version 0.7.5 of Posda7 (https://code.imphub.org/projects/PT/repos/oneposda), the open source package used for curation by TCIA. We created a file specifying the scope (Collection, Patient, Study, Series, Instance) as well as the operations to be performed, which are listed in Table 6. This file was then used by Posda to bulk edit the selected Attributes. For burn-in annotations (text within the pixel data), we extended these editing parameters to include both the text to be inserted and the coordinates of the location of the PHI on the image. Posda used the open source software package ImageMagick (https://imagemagick.org/index.php, version 7.0.9-7) to insert multiple lines of text into the Pixel Data.
De-identified evaluation dataset
To create an example of how the evaluation dataset would look once re-de-identified using tools and procedures equivalent to those in current use by TCIA, a TCIA curation team that had no knowledge of the evaluation dataset creation process was tasked with the creation of a de-identified version of the evaluation dataset. This de-Identified evaluation dataset follows the standards outlined above as the best practice for de-identification.
MIDI project dataset
The Medical Imaging De-Identification Initiative (MIDI), sponsored by the National Cancer Institute, produced a significantly larger evaluation dataset. After the creation of the full set, 21 records were split off to create the publishable evaluation dataset which is made available on TCIA and described in this publication. Please also note that we are unable to release some elements of the MIDI project due to the need to protect the integrity of the full dataset, which remains the property of the National Cancer Institute.
Data Records
MIDI-Evaluation collection
The evaluation dataset (containing synthetic PHI) and TCIA de-identified evaluation dataset (curated by TCIA) along with crosswalks for both patient IDs and DICOM UIDs between the two datasets have been published11. They may be accessed via the referenced DOI or via the TCIA collection browser as collection Pseudo-PHI-DICOM-Data (https://www.cancerimagingarchive.net/collections/).
Technical Validation
To validate resultant curated datasets, an answer key was created to compare tag states between pre and post-curated datasets. An example of the answer key can be seen in Table 7. The answer key is driven by the actions listed in Table 8 along with action text (list of text retained or removed, etc.) for the various comparisons needed for evaluation. We wrote a Python evaluation script for comparing an answer key to a de-identified dataset. The inputs to the evaluation script are the answer key files along with a Patient ID Crosswalk containing a cross-reference between the old Patient ID and the new Patient ID and a UID Crosswalk for old to new UIDs, which are used for comparison per SOP class included in the collection.
When the TCIA curation team completed their curation task of generating the de-identified evaluation dataset, we compared that dataset to the answer key, and only expected discrepancies (e.g., new UID and Patient ID mapping) were found.
Code availability
Synthetic Protected Health Information (PHI) was generated using the Faker software package (https://pypi.org/project/Faker) and inserted into selected DICOM Attributes using an extended version of the Posda7 tool suite (https://code.imphub.org/projects/PT/repos/oneposda), the open source package used for curation and de-identification by TCIA. Posda incorporated the open source software package ImageMagick (https://imagemagick.org/index.php) to insert multiple lines of text into Pixel Data.
References
Clark, K. et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 26, 1045–1057, https://doi.org/10.1007/s10278-013-9622-7 (2013).
Kushida, C. A. et al. Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care 50, S82–101, https://doi.org/10.1097/mlr.0b013e3182585355 (2012).
Chevrier, R., Foufi, V., Gaudet-Blavignac, C., Robert, A. & Lovis, C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 21, e13484, https://doi.org/10.2196/13484 (2019).
Prior, F. W. et al. Facial recognition from volume-rendered magnetic resonance imaging data. IEEE T. Inf. Technol. B. 13, 5–9 (2008).
Schwarz, C. G. et al. Identification of anonymous MRI research participants with face-recognition software. N. Engl. J. Med. 381, 1684–1686 (2019).
Robinson, J. D. Beyond the DICOM header: additional issues in deidentification. Am J Roentgenol. 203, W658–W664 (2014).
Bennett, W., Smith, K., Jarosz, Q., Nolan, T. & Bosch, W. Reengineering workflow for curation of DICOM datasets. J. Digit. Imaging. 31, 783–791 (2018).
Moore, S. M. et al. De-identification of Medical Images with Retention of Scientific Research Value. RadioGraphics 35, 727–735, https://doi.org/10.1148/rg.2015140244 (2015).
DICOM. In PS3.15 2016a - Security and System Management Profiles (NEMA, Rosslyn, VA, 2016).
Tanabe, K. Pareto’s 80/20 rule and the Gaussian distribution. Physica A: Statistical Mechanics and its Applications 510, 635–640, https://doi.org/10.1016/j.physa.2018.07.023 (2018).
Rutherford, M. et al. Dataset from Medical Imaging De-Identification Initiative (MIDI). The Cancer Imaging Archive https://doi.org/10.7937/s17z-r072 (2021).
Acknowledgements
This project has been funded in whole or in part with federal funds from the National Cancer Institute, Contract No. 75N91019D00024, Subcontract 20X023F. Funding for Posda development is provided by U24CA215109. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
Author information
Authors and Affiliations
Contributions
All authors reviewed and contributed to the manuscript. Tarbox originated the concept of an evaluation dataset and process. Farahani conceived the MIDI project, and Wagner served as project manager. Freymann is the overall technical project manager for TCIA. Rutherford extracted and aggregated TCIA audit logs, built image corpus, generated synthetic data and files for re-identification, and generated answer keys and evaluation process and script. Mun and Prior provided domain expertise and coordinated the creation and editing of the manuscript as well as co-directing the MIDI project. Bennett created and performed the “re-identification” process using Posda. Farmer performed statistical analysis on TCIA audit logs and generated frequency tables. Jarosz created the ability to add text into images as a new process in Posda. Smith directed the TCIA curation process, which was carried out by Blake, and he and Levine provided TCIA curation expertise.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.
About this article
Cite this article
Rutherford, M., Mun, S.K., Levine, B. et al. A DICOM dataset for evaluation of medical image de-identification. Sci Data 8, 183 (2021). https://doi.org/10.1038/s41597-021-00967-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-021-00967-y
This article is cited by
-
A Reversible Medical Image Watermarking Scheme for Advanced Image Tampering Detection
SN Computer Science (2024)
-
Interoperable slide microscopy viewer and annotation tool for imaging data science and computational pathology
Nature Communications (2023)