POLCOVID: a multicenter multiclass chest X-ray database (Poland, 2020–2021)

The outbreak of the SARS-CoV-2 pandemic has put healthcare systems worldwide to their limits, resulting in increased waiting time for diagnosis and required medical assistance. With chest radiographs (CXR) being one of the most common COVID-19 diagnosis methods, many artificial intelligence tools for image-based COVID-19 detection have been developed, often trained on a small number of images from COVID-19-positive patients. Thus, the need for high-quality and well-annotated CXR image databases increased. This paper introduces POLCOVID dataset, containing chest X-ray (CXR) images of patients with COVID-19 or other-type pneumonia, and healthy individuals gathered from 15 Polish hospitals. The original radiographs are accompanied by the preprocessed images limited to the lung area and the corresponding lung masks obtained with the segmentation model. Moreover, the manually created lung masks are provided for a part of POLCOVID dataset and the other four publicly available CXR image collections. POLCOVID dataset can help in pneumonia or COVID-19 diagnosis, while the set of matched images and lung masks may serve for the development of lung segmentation solutions.


Background & Summary
The outbreak of the SARS-CoV-2 pandemic in 2020 has made healthcare systems worldwide face new challenges.Limited testing capacity, especially in the early phases of pandemics, shortages of adequate equipment, and overloaded hospitals were the main factors inhibiting the process of sufficient patient diagnosis and management 1,2 .Hence, chest radiography became a crucial diagnostic tool, especially for individuals experiencing dyspnea 3,4 .Also, patients requiring rapid treatment and support in the form of oxygenation or ventilation often were unable to wait for the RT-PCR test result.COVID-19 pandemic and the challenges it caused led to the development of many Artificial Intelligence (AI)-based tools for COVID-19 detection 5,6,7 .Consequently, with all the advantages of the AI-assisted diagnosis process, there appeared a great need for reliable, high-quality, and universal imaging datasets 8 .
Here, we provide two datasets used for different purposes in our studies.The first dataset was created for COVID-19 detection and includes a set of 4809 chest X-ray (CXR) images collected from COVID-19 positive and negative patients in 15 Polish hospitals.Medical doctors labelled all CXR pictures based on diagnosis as COVID-19 (n=1236), other-type pneumonia (n=1147), or healthy, normal lungs (n=2426).Some radiographs were also annotated with demographic information such as age, sex, and smoking history.The cohort is sufficiently balanced in terms of sex (1415 males, 1243 females) and heterogeneous in terms of age, ranging from 0 to 99 years.As medical centers which provided the data are in various regions of Poland, the study population is representative.As an extension to the original CXR images, we deliver their preprocessed versions limited to the lung area and the corresponding lung masks generated by our lung segmentation model.We also provide the disease subtype prediction for each patient that explains the heterogeneity within each group.
The second dataset served to build the lung segmentation model.It contains lung masks manually created by experts for 6297 chest images, including 4003 from Polish hospitals.For those, we deliver the corresponding original CXRs.The rest of the chest images came from publicly available sources, therefore we only provide their masks.
POLCOVID dataset can serve for the generation of novel pneumonia and/or COVID-19 screening or diagnosis tools, while the set of matched images and lung masks may support the development of lung segmentation solutions.

Ethical statement
The project was approved by the committees of all collaborating medical centers.Patients provided informed consent to participate in the study.We removed all identifiable patient information.We complied with all relevant ethical regulations and guidelines.

Data source
Fifteen medical centers from seven regions of Poland participated in the data acquisition.At each hospital, patients were diagnosed with COVID-19 or other types of pneumonia based on radiological findings or labeled as normal otherwise.COVID-19 was confirmed radiologically in all COVID-19 positive cases.This diagnosis was moreover supported with an RT-PCR test.All COVID-19 positive patients required medical assistance, although they might have developed various symptoms.The centers uploaded the data in the time range from August 7th, 2020, to April 7th, 2021.Hence, no Omicron SARS-CoV-2 variant-infected patients participated in the study, as the first reports of this variant appeared in November 2021 9 .The summary of the number of CXR images provided by each medical center is presented in Table 1  Imaging CXR images were collected using various devices and parameters due to differences in equipment between medical centers.All radiographs were performed in a frontal projection.

Data collection
We created a web service dedicated to medical centers participating in the project to provide the data in a secure manner.Registered users from the POLCOVID Study Group uploaded radiographs annotated with a medical diagnosis.When available, medical centers attached a more detailed patient description including demographic and clinical information such as sex, age, and smoking history.X-ray images were stored in the Digital Imaging and Communication in Medicine (DICOM) 10 or JPEG formats, depending on the uploader.Exemplary CXR images representing COVID-19, pneumonia, and normal patients are presented in Fig. 1a.

Data preparation
We applied the U-Net neural network to segment the lung area from the standardized and contrast-enhanced CXR images 11  During the model generation, the sigmoid (for the last convolutional layer) and the Scaled Exponential Linear Unit (SELU) (for all remaining layers) served as activation functions, the Sorensen-Dice coefficient (SDC) as a similarity measure for the loss function, and the adaptive learning rate method ADAM 15 as the optimization algorithm.With the model-generated masks, we limited the standardized image to the lung area -the region of interest (ROI), further resized to 512x512 pixels with the original aspect ratio.Prazuch et al. 16 precisely described the lung segmentation procedure.For all the POLCOVID CXRs, we deliver resized ROI images and model-generated lung masks adjusted to the ROI dimensions.Exemplary ROI images and lung masks representing COVID-19, pneumonia, and normal patients are presented in Fig. 1b, c.As a separate data subset, we also provide all manually annotated lung masks and the original POLCOVID CXRs used to generate the lung segmentation model.

Demographic summary
The patient sex is well-balanced in the normal (554 males, 583 females) and COVID-19 (492 males, 437 females) groups and in the whole cohort (1415 males, 1243 females).In the pneumonia group, male patients are overrepresented (369 males, 223 females).The summary of sex and smoking status in total and regarding diagnosis is presented in Table 3. Proportions of sexes in diagnosis groups and in the whole cohort are presented in Fig. 2a.
Table 3.The summary of sex and smoking status in the whole cohort and in diagnosis groups.
The dataset is highly heterogeneous in terms of patient age, ranging from 0 to 99 years, with a mean and median equal to 60.24 and 63 years, respectively.Age distributions differ significantly between the patient groups (Kruskal-Wallis one-way analysis of variance p-value < 10 -6 ).The median age of COVID-19 and pneumonia patients is equal (67 years) with a similar range.In the normal group, the median age is lower with a smaller range compared to other patients.Games-Howell post-hoc tests showed significant differences in age distribution only in the normal group compared to the remaining two (both p-values < 10 -6 ).For COVID-19 versus pneumonia comparison, the p-value equalled 0.9.The summary of age and pack-years in total and regarding diagnosis is presented in Table 4. Age distributions in diagnosis groups and in the whole cohort are presented in Fig. 2b.

Disease subtype prediction
We used the nUMAP method from Suwalska et al. 17 to predict the disease subtype, as described in Prazuch et al. 16 For this step, we extended our POLCOVID dataset with two publicly available chest CXRs databases: COVIDx 18 (n=15403) and AIforCovid 19 (n=1105).The nUMAP approach involves the neural network serving as a feature extractor.It takes CXR images with clinical information as an input and provides a numerical data matrix with features' values per image as an output of the final fully connected layer.We applied the standard UMAP algorithm with the cosine distance metrics on the numerical feature vectors to visualize the data in the two-dimensional space.This projection served for fitting the twodimensional Gaussian mixture model (2D GMM) with the modified expectation-maximization (EM) algorithm, as explained in Marczyk 20 .We obtained three mixture model components per diagnosis category (COVID-19, pneumonia, and normal), each representing a different disease subtype.The first subtypes correspond to the typical representatives of each group (denoted as C1, P1, and N1, respectively).The second subtypes contain mild cases (C2, P2, and N2, respectively).The third subtypes (C3, P3, N3) show the smallest differences between the groups and represent the atypical cases.The results of 2D GMM fitting to the nUMAP embedding are shown in Fig. 3.

POLCOVID image collection
All CXR images are available in de-identified form from the CIRCA COVID-19 CXR/CT-based diagnosis web service 21 .Access to the database is controlled.The user is required to register and provide the name, institution, e-mail address, and the purpose of data usage.Once registered, the user may download the radiographs (in the DICOM or JPEG format), preprocessed images, and lung masks of a group of interest.

POLCOVID metadata structure
The metadata files are available for registered users from the CIRCA COVID-19 CXR/CT-based diagnosis web service 21 in the form of the Microsoft Excel spreadsheet for all files and CSV files for each group separately.They contain patient demographic and clinical data, group and subtype labels, information regarding the hospital of data collection, and image quality category.Table 5 defines the variables included in the metadata.

quality
Image quality category: "Good" -sufficient quality, "Bad" -insufficient quality.The criteria for quality assessment are described in the Technical Validation section.
set Set to which the image was included in Prazuch et al. 16 : "train"training set, "hold-out test" -testing set.

Lung segmentation image collection
We provide the manually created lung masks in the PNG format for all CXRs images used to generate the lung segmentation model (radiographs delivered by the POLCOVID Study Group and collected from the publicly available databases).Moreover, the original de-identified versions of POLCOVID CXR images used for the manual mask annotation are also available.Registered users may download the POLCOVID unprocessed images as the TIFF files and all lung masks in the PNG format for each data source separately from the CIRCA COVID-19 CXR/CT-based diagnosis web service 21 .

Lung segmentation metadata
The metadata for radiographs used to create the lung segmentation model are available in the CSV file for registered users from the CIRCA COVID-19 CXR/CT-based diagnosis web service 21 .They contain information regarding the data source, the file names consistent with those used by data providers, and the subset to which we assigned an image in the model generation process.Table 6 defines the variables included in the metadata.

Anonymization
We carefully de-identified all radiographs.We deleted all identifiable metadata stored in DICOM objects and manually reviewed all image data.All personal information on radiographs was also removed.

Data quality control
We curated the database based on the DICOM headers when available.We visually inspected every X-ray image and removed all radiographs with lateral projections, incomplete lung regions, and improperly saved or stored.We reviewed the clinical data for consistency and filled in the missing demographic fields if an uploader provided the lacking information elsewhere.Moreover, we further investigated the image quality.We selected very low-resolution images characterized by lung area smaller than 300 pixels in height or width.We also identified the radiographs whose quality prevents proper lung segmentation, leaving one or both lungs mostly or entirely undetected.We characterized the segmentation quality by the score defined as the mean value of four lung mask properties: eccentricity, orientation, area, and solidity, as explained in Prazuch et al. 16 The lung segmentation quality score was normalized to range from 0 to 1.We identified poor-quality images with outlying quality scores with the outlier detection method dedicated to skewed data 22 .The lung segmentation model performed satisfactorily with SDC equal to 94.86% and 93.36% for the validation and testing datasets, respectively.We moreover visually inspected the obtained lung masks to ensure the high quality of the segmentation process.

Fig. 1 .
Fig. 1.Exemplary images included in the POLCOVID dataset for one representative of each diagnosis group.Original CXR images (a), preprocessed lung area images (b), and lung masks (c) of normal, pneumonia and COVID-19 cases.

Fig. 2 .
Fig. 2. Demographic summary of the cohort.Proportions of sexes in diagnosis groups and in total are accompanied by numbers of images and numbers of missing records (a).Age distributions in diagnosis groups and in total (b).

Table 1 .
with regard to diagnosis.Numbers of CXR images provided by each medical center with regard to the diagnosis.

Table 2 .
Numbers of CXR images used for the lung segmentation model training, with regard to the data source and subset.
13For lung segmentation model training and testing, we used 6297 CXR images.Out of those, 4003 radiographs were a part of our POLCOVID dataset.The remaining 2294 CXRs came from the publicly available collections: the National Institute of Health -Clinical Center database 12 (1124 CXRs), Shenzhen No.3 Hospital, Shenzhen, China 13 (662 CXRs), the tuberculosis control program of the Department of Health and Human Services of Montgomery County, USA13(138 CXRs), and Guangzhou Women and Children's Medical Center, Guangzhou, China 14 (370 CXRs).Experts manually annotated each CXR picture with a lung mask.We randomly divided the CXRs into the training (n=5247), validation (n=500), and test (n=550) subsets.A detailed summary of subsets regarding the image source is presented in Table2.

Table 4 .
The summary of age and pack-years status in the whole cohort and in diagnosis groups.

Table 5 .
Definition of variables included in the POLCOVID metadata file.

Table 6 .
Definition of variables included in the lung segmentation metadata file.