PediCXR: An open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children

Computer-aided diagnosis systems in adult chest radiography (CXR) have recently achieved great success thanks to the availability of large-scale, annotated datasets and the advent of high-performance supervised learning algorithms. However, the development of diagnostic models for detecting and diagnosing pediatric diseases in CXR scans is undertaken due to the lack of high-quality physician-annotated datasets. To overcome this challenge, we introduce and release PediCXR, a new pediatric CXR dataset of 9,125 studies retrospectively collected from a major pediatric hospital in Vietnam between 2020 and 2021. Each scan was manually annotated by a pediatric radiologist with more than ten years of experience. The dataset was labeled for the presence of 36 critical findings and 15 diseases. In particular, each abnormal finding was identified via a rectangle bounding box on the image. To the best of our knowledge, this is the first and largest pediatric CXR dataset containing lesion-level annotations and image-level labels for the detection of multiple findings and diseases. For algorithm development, the dataset was divided into a training set of 7,728 and a test set of 1,397. To encourage new advances in pediatric CXR interpretation using data-driven approaches, we provide a detailed description of the PediCXR data sample and make the dataset publicly available on https://physionet.org/content/vindr-pcxr/1.0.0/.


Background & Summary
Common thoracic diseases cause several hundred thousand deaths every year among children under five years old 1,2 .The chest radiograph or CXR is the first-line and most commonly performed imaging examination in the assessment of the pediatric patient 3 .Interpreting CXR scans on pediatric patients can be for a number of indications or critical findings, in particular for common thoracic diseases in children such as Pneumonia, Bronchitis and Cardiovascular diseases (CVDs).Depending on the patients' age, the difficulty of the examination will vary, often requiring a specialist in pediatric diagnostic imaging with an in-depth knowledge of radiological signs of different lung conditions 4 .Additionally, the inter-observer agreement and intra-observer agreement in the pediatric CXR interpretation were low 5 .This opens room for the development of data-driven approaches and computational tools to assist pediatricians in the diagnosis of common thoracic diseases and to reduce their workload.
Computer-aided diagnosis (CAD) systems for identification of lung abnormality in adult CXRs have recently achieved great success thanks to the availability of large labeled datasets [6][7][8][9][10] .Many large-scale CXR datasets of adult patients such as Montgomery County chest X-ray (MC) 11 , Shenzhen chest X-ray 11 , ChestX-ray8 6 , COVIDGR 12 , ChestX-ray14 6 , Padchest 7 , CheXpert 8 , MIMIC-CXR 9 and VinDr-CXR 10 have been established and released in recent years.These datasets boosted new advances in exploring new machine learning-based approaches in the interpretation of CXR in adults 8,[13][14][15][16][17][18] .Unfortunately, the creation of pediatric CXR datasets is still unexploited, and the number of benchmark pediatric CXR datasets is limited.This becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in clinical practice.
In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we have built the PediCXR dataset in DICOM format.The dataset consists of 9,125 posteroanterior (PA) view CXR scans in patients younger than 10 years that were retrospectively collected from three major hospitals in Vietnam from 2020 to 2021.
In particular, all CXR scans come with both the localization of critical findings and the classification of common thoracic diseases.These images were annotated by a group of three radiologists with at least 10 years of experience for the presence of 36 critical findings (local labels) and 15 diagnoses (global labels).Here, the local labels should be annotated with rectangle bounding boxes that localize the findings, while the global labels reflect the diagnostic impression of the radiologist at the image-level.For algorithm development, we randomly divided the dataset into two parts: the training set of 7,728 scans (84.7%) and the test set of 1,397 scans (15.3%).To the best of our knowledge, the released PediCXR is currently the largest public pediatric CXR dataset with radiologist-generated annotations in both training and test sets.Table 1 below shows an overview of existing public datasets for CXR interpretation in pediatric patients, compared with the PediCXR.Compared to the previous works, the PediCXR dataset shows two main advantages.First, the dataset is labeled for multiple findings and diseases.Meanwhile, most pediatric CXR datasets have focused on a single disease such as pneumonia 19 or pneumothorax 20 .Second, the dataset provides bounding box annotations at lesion level, which is useful for developing explainable artificial intelligent models 21 for the CXR interpretation in children.We believe the introduction of the PediCXR provides a suitable imaging source for investigating the ability of supervised machine learning models in identifying common lung diseases in pediatric patients.

Data collection
Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020 -2021.The ethical clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH.The need for obtaining informed patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals, and all patient-identifiable information in the data has been removed.We retrospectively collected more than 10,000 CXRs in DICOM format from a local picture archiving and communication system (PACS) at PTOPH.The imaging dataset was then transferred and analyzed at Smart Health Center, VinBigData JSC.

Overview of approach
The building of the PediCXR dataset is illustrated in Figure 1.In particular, the collection and normalization of the dataset were divided into four main steps: (1) data collection, (2) data de-identification, (3) data filtering, and (4) data labeling.We describe each step in detail as below.

Data de-identification
In this study, we follow the HIPAA Privacy Rule 22 to protect individually identifiable health information from the DICOM images.To this end, we removed or replaced with random values all personally identifiable information associated with the images via a two-stage de-identification process.At the first stage, a Python script was used to remove all DICOM tags of protected health information (PHI) 23 such as patient's name, patient's date of birth, patient ID, or acquisition time and date, etc.For the purpose of loading and processing DICOM files, we only retained a limited number of DICOM attributes that are necessary, as indicated in Table 2 (Supplementary materials).In the second stage, we manually removed all textual information appearing on the image data, i.e., pixel annotations that could include patient's identifiable information.

Data filtering
The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than chest (abdominal, spine, and others), low-quality images, or lateral CXRs.To filter a large number of CXR scans, we trained a lightweight convolutional neural network (CNN) 24 to remove all outliers automatically.Next, a manual verification was performed to ensure all outliers had been fully removed.

Data labeling
The PediCXR dataset was labeled for a total of 36 findings and 15 diagnoses.These labels were divided into two categories: local labels (#1 -#36) and global labels (#37 -#52).The local labels should be marked with bounding boxes that localize the findings, while the global labels should reflect the diagnostic impression of the radiologist.This list of labels was suggested by a committee of the most experienced pediatric radiologists.To select these labels, the committee took into account two key factors.First, findings and diseases are prevalent.Second, they can be differentiated on pediatric chest X-ray scans.Figure 2 illustrates several samples with both local and global labels annotated by our radiologists.
To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab 25 that allows a team of experienced radiologists remotely annotate the data.Specifically, this is a web-based labeling tool that was developed to store, manage, and remotely annotate DICOM data.The radiologists were oriented to locate the abnormal findings from the DICOM viewer and draw the bounding boxes.All the annotators have been well-trained to ensure that the annotations are consistently annotated.In addition, all the radiologists participating in the labeling process were certified in diagnostic radiology and received healthcare professional certificates.In total, three pediatric radiologists with at least 15 years of experience were involved in the annotation process.Each sample in the training set was assigned to one radiologist for annotation.Additionally, all of the participating radiologists were blinded to relevant clinical information.A set of 9,125 pediatric CXRs were randomly annotated from the filtered data, of which 7,728 scans serve as the training set, and the remaining 1,397 studies form the test set.Note the 9,125 studies correspond to 9,125 patients, and each study has a single CXR scan.Once the labeling was completed, the annotations of all pediatric CXRs were exported in JavaScript Object Notation (JSON) format.We developed a Python script to parse JSON files and organized the annotations in the form of a single comma-separated values (CSV) file.Each CSV file contains labels, bounding box coordinates, and their corresponding image identifiers (IDs).The data characteristics, including patient demographic and the prevalence of each finding or disease, are summarized in Table 3.The distributions of abnormal findings and pathologies in the training set are drawn in Figure 3 and Figure 4, respectively.

Data Records
The PediCXR dataset will be made available for public download on PhysioNet ? .We offer complete imaging data as well as ground truth labels for both the training and test datasets.The pediatric scans were split into two folders: one for training and one for testing, named as "train" and "test", respectively.Since each study has only one instance and each patient has maximum one study, therefore, the value of the SOP Instance UID provided by the DICOM tag (0008, 0018) was encoded into a unique, anonymous identifier for each image.To this end, we used the Python hashlib module (see Code Availability) to encode the SOP Instance UIDs into image IDs.The radiologists' local annotations of the training set were provided in a CSV file called annotations_train.csv.Each row of the CSV file represents a bounding box annotation with the following attributes: image ID (image_id), radiologist ID (rad_id), label's name (class_name), bounding box 6/9 coordinates (x_min, y_min, x_max, y_max), and label class ID (class_id).The coordinates of the box's upper-left corner are (x_min, y_min), and the coordinates of the box's lower right corner are (x_max, y_max).Meanwhile, the image-level labels of the training set were stored in a different CSV file called image_labels_train.csv, with the following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses.Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies encoded as "1" and negative pathologies encoded as "0".Similarly, the test set's bounding-box annotations and image-level labels were saved in the files annotations_test.csvand image_labels_test.csv,respectively.

Technical Validation
The data de-identification process was controlled.Specifically, all DICOM meta-data was parsed and manually reviewed to ensure that all individually identifiable health information (PHI) 23 of the children patients has been removed to meet the U.S. HIPAA 22 regulations.In addition, pixel values of all pediatric CXR scans were also carefully examined by human readers.During this review process, all scans were manually reviewed case-by-case by a team of 10 human readers.A small number of images containing private textual information that had not been removed by our algorithm was excluded from the dataset.The manual review process also helped identify and discard out-of-distribution samples such as CXRs of adult patients, body parts other than the chest, low-quality images, or lateral CXRs that our machine learning classifier was not able to detect.A set of rules underlying our web-based annotation tool were developed to control the quality of the labeling process.These rules prevent human annotators from mechanical mistakes like forgetting to choose global labels or marking lesions on the image while choosing "No finding" as the global label.

Usage Notes
The PediCXR dataset was established for the purpose of developing and evaluating machine learning algorithms for detecting and localizing anomalies in pediatric CXR images.The dataset has been previously used in a study on the diagnosis of multiple diseases in pediatric patients 26 and showed promising results.Specifically, the authors 26 introduced a deep learning network to detect common pulmonary pathologies on CXR of pediatric patients.On the test set of 777 studies of the PediCXR dataset, the network yielded an area under the receiver operating characteristic (AUC) of 0.709 (95% CI, 0.690-0.729).The sensitivity, specificity, and F1-score at the cutoff value are 0.722 (0.694-0.750), 0.579 (0.563-0.595), and 0.389 (0.373-0.405), respectively.However, they recognized that its performance remains low compared to medical experts.This work revealed the major challenge in learning disease features on pediatric CXR images using representation learning techniques, opening huge aspects for future research.
The primary uses for which the PediCXR dataset was conceptualized include: • Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.
• Developing and validating a predictive model for the localization of multiple abnormal findings on the pediatric chest X-ray scans.
Finally, the released dataset remains with limitations that still need to be addressed in the future, including: • The dataset did not contain clinical information associated with DICOM images, which is essential for the interpretation of CXR in children patients.
To download and use the PediCXR , users are required to accept the PhysioNet Credentialed Health Data License 1.5.0.By accepting this license, users agree that they will not share access to the dataset with anyone else.For any publication that explores this resource, the authors must cite this original paper and release their code and models.

Figure 1 .
Figure 1.Construction of the PediCXR dataset.First, raw pediatric scans in DICOM format were collected retrospectively from the hospital's PACS at PTOPH.These images were de-identified to protect patient's privacy.Then, invalid files (including adult CXR images, images of other modalities or other body parts, images with low quality, or incorrect orientation) were manually filtered out.After that, a web-based DICOM labeling tool called VinDr Lab was developed to remotely annotate DICOM data.Finally, the annotated dataset was then divided into a training set (N = 7,728) and a test set (N = 1,397 ) for algorithm development.

Figure 2 .
Figure 2. Several examples of pediatric CXR images with radiologist's annotations.Local labels marked by radiologists are plotted on the original images for visualization purposes.These annotations show abnormal findings from the scans.The global labels, that classify images into diseases, are in bold and listed at the bottom of each example.

Figure 3 .
Figure 3. Distribution of abnormal findings on the training set of PediCXR.Rare findings (less than 10 examples) are not included.

Figure 4 .
Figure 4. Distribution of pathologies on the training set of PediCXR.Rare diseases (less than 10 examples) are not included.

Table 1 .
An overview of existing public datasets for CXR interpretation in pediatric patients.

Table 2 .
The list of DICOM tags that were retained for loading and processing raw images.All other tags were removed for protecting patient privacy.Details about all these tags can be found from DICOM Standard Browser at https://dicom.innolitics.com/ciods.Ratio of the vertical size and horizontal size of the pixels in the image specified by a pair of integer values where the first value is the vertical pixel size, and the second value is the horizontal pixel size.