Background & Summary

Common thoracic diseases cause several hundred thousand deaths every year among children under five years old1,2. The chest radiograph or CXR is the first-line and most commonly performed imaging examination in the assessment of the pediatric patient3. Interpreting CXR scans on pediatric patients can be for a number of indications or critical findings, in particular for common thoracic diseases in children such as Pneumonia, Bronchitis and Cardiovascular diseases (CVDs). Depending on the patients’ age, the difficulty of the examination will vary, often requiring a specialist in pediatric diagnostic imaging with an in-depth knowledge of radiological signs of different lung conditions4. Additionally, the inter-observer agreement and intra-observer agreement in the pediatric CXR interpretation were low5. This opens room for the development of data-driven approaches and computational tools to assist pediatricians in the diagnosis of common thoracic diseases and to reduce their workload.

Computer-aided diagnosis (CAD) systems for identification of lung abnormality in adult CXRs have recently achieved great success thanks to the availability of large labeled datasets6,7,8,9,10. Many large-scale CXR datasets of adult patients such as Montgomery County chest X-ray (MC)11, Shenzhen chest X-ray11, ChestX-ray86, COVIDGR12, ChestX-ray146, Padchest7, CheXpert8, MIMIC-CXR9 and VinDr-CXR10 have been established and released in recent years. These datasets boosted new advances in exploring new machine learning-based approaches in the interpretation of CXR in adults8,13,14,15,16,17,18. Unfortunately, the creation of pediatric CXR datasets is still unexploited, and the number of benchmark pediatric CXR datasets is limited. This becomes the main obstacle in developing and transferring new machine learning-based CAD systems for pediatric CXR in clinical practice.

In an effort to provide a large-scale pediatric CXR dataset with high-quality annotations for the research community, we have built the PediCXR dataset in DICOM format. The dataset consists of 9,125 posteroanterior (PA) view CXR scans in patients younger than 10 years that were retrospectively collected from three major hospitals in Vietnam from 2020 to 2021. In particular, all CXR scans come with both the localization of critical findings and the classification of common thoracic diseases. These images were annotated by a group of three radiologists with at least 10 years of experience for the presence of 36 critical findings (local labels) and 15 diagnoses (global labels). Here, the local labels should be annotated with rectangle bounding boxes that localize the findings, while the global labels reflect the diagnostic impression of the radiologist at the image-level. For algorithm development, we randomly divided the dataset into two parts: the training set of 7,728 scans (84.7%) and the test set of 1,397 scans (15.3%). To the best of our knowledge, the released PediCXR is currently the largest public pediatric CXR dataset with radiologist-generated annotations in both training and test sets. Table 1 below shows an overview of existing public datasets for CXR interpretation in pediatric patients, compared with the PediCXR. Compared to the previous works, the PediCXR dataset shows two main advantages. First, the dataset is labeled for multiple findings and diseases. Meanwhile, most pediatric CXR datasets have focused on a single disease such as pneumonia19 or pneumothorax20. Second, the dataset provides bounding box annotations at lesion level, which is useful for developing explainable artificial intelligent models21 for the CXR interpretation in children. We believe the introduction of the PediCXR provides a suitable imaging source for investigating the ability of supervised machine learning models in identifying common lung diseases in pediatric patients.

Table 1 An overview of existing public datasets for CXR interpretation in pediatric patients.


Data collection

Data collection was conducted at the Phu Tho Obstetric & Pediatric Hospital (PTOPH) between 2020–2021. The ethical clearance of this study was approved by the Institutional Review Boards (IRBs) of the PTOPH. The need for obtaining informed patient consent was waived because this retrospective study did not impact clinical care or workflow at these two hospitals, and all patient-identifiable information in the data has been removed. We retrospectively collected more than 10,000 CXRs in DICOM format from a local picture archiving and communication system (PACS) at PTOPH. The imaging dataset was then transferred and analyzed at Smart Health Center, VinBigData JSC.

Overview of approach

The building of the PediCXR dataset is illustrated in Fig. 1. In particular, the collection and normalization of the dataset were divided into four main steps: (1) data collection, (2) data de-identification, (3) data filtering, and (4) data labeling. We describe each step in detail as below.

Fig. 1
figure 1

Construction of the PediCXR dataset. First, raw pediatric scans in DICOM format were collected retrospectively from the hospital’s PACS at PTOPH. These images were de-identified to protect patient’s privacy. Then, invalid files (including adult CXR images, images of other modalities or other body parts, images with low quality, or incorrect orientation) were manually filtered out. After that, a web-based DICOM labeling tool called VinDr Lab was developed to remotely annotate DICOM data. Finally, the annotated dataset was then divided into a training set (N = 7,728) and a test set (N = 1,397) for algorithm development.

Data de-identification

In this study, we follow the HIPAA Privacy Rule22 to protect individually identifiable health information from the DICOM images. To this end, we removed or replaced with random values all personally identifiable information associated with the images via a two-stage de-identification process. At the first stage, a Python script was used to remove all DICOM tags of protected health information (PHI)23 such as patient’s name, patient’s date of birth, patient ID, or acquisition time and date, etc. For the purpose of loading and processing DICOM files, we only retained a limited number of DICOM attributes that are necessary, as indicated in Table 2 (Supplementary materials). In the second stage, we manually removed all textual information appearing on the image data, i.e., pixel annotations that could include patient’s identifiable information.

Table 2 The list of DICOM tags that were retained for loading and processing raw images.

Data filtering

The collected raw data included a significant amount of outliers including CXRs of adult patients, body parts other than chest (abdominal, spine, and others), low-quality images, or lateral CXRs. To filter a large number of CXR scans, we trained a lightweight convolutional neural network (CNN)24 to remove all outliers automatically. Next, a manual verification was performed to ensure all outliers had been fully removed.

Data labeling

The PediCXR dataset was labeled for a total of 36 findings and 15 diagnoses. These labels were divided into two categories: local labels (#1– #36) and global labels (#37– #52). The local labels should be marked with bounding boxes that localize the findings, while the global labels should reflect the diagnostic impression of the radiologist. This list of labels was suggested by a committee of the most experienced pediatric radiologists. To select these labels, the committee took into account two key factors. First, findings and diseases are prevalent. Second, they can be differentiated on pediatric chest X-ray scans. Figure 2 illustrates several samples with both local and global labels annotated by our radiologists.

Fig. 2
figure 2

Several examples of pediatric CXR images with radiologist’s annotations. Local labels marked by radiologists are plotted on the original images for visualization purposes. These annotations show abnormal findings from the scans. The global labels, that classify images into diseases, are in bold and listed at the bottom of each example.

To facilitate the labeling process, we designed and built a web-based framework called VinDr Lab25 that allows a team of experienced radiologists remotely annotate the data. Specifically, this is a web-based labeling tool that was developed to store, manage, and remotely annotate DICOM data. The radiologists were oriented to locate the abnormal findings from the DICOM viewer and draw the bounding boxes. All the annotators have been well-trained to ensure that the annotations are consistently annotated. In addition, all the radiologists participating in the labeling process were certified in diagnostic radiology and received healthcare professional certificates. In total, three pediatric radiologists with at least 15 years of experience were involved in the annotation process. Each sample in the training set was assigned to one radiologist for annotation. Additionally, all of the participating radiologists were blinded to relevant clinical information. A set of 9,125 pediatric CXRs were randomly annotated from the filtered data, of which 7,728 scans serve as the training set, and the remaining 1,397 studies form the test set. Note the 9,125 studies correspond to 9,125 patients, and each study has a single CXR scan.

Once the labeling was completed, the annotations of all pediatric CXRs were exported in JavaScript Object Notation (JSON) format. We developed a Python script to parse JSON files and organized the annotations in the form of a single comma-separated values (CSV) file. Each CSV file contains labels, bounding box coordinates, and their corresponding image identifiers (IDs). The data characteristics, including patient demographic and the prevalence of each finding or disease, are summarized in Table 3. The distributions of abnormal findings and pathologies in the training set are drawn in Figs. 3, 4, respectively.

Table 3 Dataset characteristics of PediCXR.
Fig. 3
figure 3

Distribution of abnormal findings on the training set of PediCXR. Rare findings (less than 10 examples) are not included.

Fig. 4
figure 4

Distribution of pathologies on the training set of PediCXR. Rare diseases (less than 10 examples) are not included.

Data Records

The PediCXR dataset will be made available for public download on PhysioNet26. We offer complete imaging data as well as ground truth labels for both the training and test datasets. The pediatric scans were split into two folders: one for training and one for testing, named as “train” and “test”, respectively. Since each study has only one instance and each patient has maximum one study, therefore, the value of the SOP Instance UID provided by the DICOM tag (0008,0018) was encoded into a unique, anonymous identifier for each image. To this end, we used the Python hashlib module (see Code Availability) to encode the SOP Instance UIDs into image IDs. The radiologists’ local annotations of the training set were provided in a CSV file called annotations_train.csv. Each row of the CSV file represents a bounding box annotation with the following attributes: image ID (image_id), radiologist ID (rad_id), label’s name (class_name), bounding box coordinates (x_min, y_min, x_max, y_max), and label class ID (class_id). The coordinates of the box’s upper-left corner are (x_min, y_min), and the coordinates of the box’s lower right corner are (x_max, y_max). Meanwhile, the image-level labels of the training set were stored in a different CSV file called image_labels_train.csv, with the following fields: Image ID (image_id), radiologist ID (rad_ID), and labels (labels) for both the findings and diagnoses. Each image ID is associated with a vector of multiple labels corresponding to different pathologies, with positive pathologies encoded as “1” and negative pathologies encoded as “0”. Similarly, the test set’s bounding-box annotations and image-level labels were saved in the files annotations_test.csv and image_labels_test.csv, respectively.

Technical Validation

The data de-identification process was controlled. Specifically, all DICOM meta-data was parsed and manually reviewed to ensure that all individually identifiable health information (PHI)23 of the children patients has been removed to meet the U.S. HIPAA22 regulations. In addition, pixel values of all pediatric CXR scans were also carefully examined by human readers. During this review process, all scans were manually reviewed case-by-case by a team of 10 human readers. A small number of images containing private textual information that had not been removed by our algorithm was excluded from the dataset. The manual review process also helped identify and discard out-of-distribution samples such as CXRs of adult patients, body parts other than the chest, low-quality images, or lateral CXRs that our machine learning classifier was not able to detect. A set of rules underlying our web-based annotation tool were developed to control the quality of the labeling process. These rules prevent human annotators from mechanical mistakes like forgetting to choose global labels or marking lesions on the image while choosing “No finding” as the global label.

Usage Notes

The PediCXR dataset was established for the purpose of developing and evaluating machine learning algorithms for detecting and localizing anomalies in pediatric CXR images. The dataset has been previously used in a study on the diagnosis of multiple diseases in pediatric patients27 and showed promising results. Specifically, the authors27 introduced a deep learning network to detect common pulmonary pathologies on CXR of pediatric patients. On the test set of 777 studies of the PediCXR dataset, the network yielded an area under the receiver operating characteristic (AUC) of 0.709 (95% CI, 0.690–0.729). The sensitivity, specificity, and F1-score at the cutoff value are 0.722 (0.694–0.750), 0.579 (0.563–0.595), and 0.389 (0.373–0.405), respectively. However, they recognized that its performance remains low compared to medical experts. This work revealed the major challenge in learning disease features on pediatric CXR images using representation learning techniques, opening huge aspects for future research.

The primary uses for which the PediCXR dataset was conceptualized include:

  • Developing and validating a predictive model for the classification of common thoracic diseases in pediatric patients.

  • Developing and validating a predictive model for the localization of multiple abnormal findings on the pediatric chest X-ray scans.

Finally, the released dataset remains with limitations that still need to be addressed in the future, including:

  • The dataset did not contain clinical information associated with DICOM images, which is essential for the interpretation of CXR in children patients.

  • The number of examples for rare diseases (e.g., Congenital pulmonary airway malformation (CPAM), Congenital emphysema, Diagphramatic hernia, Mediastinal tumor, Pleuro-pneumonia, Situs inversus, Lung tumor) or findings (Emphysema, Edema, Calcification, Chest wall mass, Bronchectasis, Pleural thickening, Clavicle fracture, Pleuropulmonary mass, Paraveterbral mass, etc.) are limited. Hence, training supervised learning algorithms, which requires a large-scale annotated dataset, on the PediCXR dataset to diagnose the rare diseases and findings is not reliable.

To download and use the PediCXR, users are required to accept the Credentialed Health Data License 1.5.0. By accepting this license, users agree that they will not share access to the dataset with anyone else. For any publication that explores this resource, the authors must cite this original paper and release their code and models.