Pixel-level annotated dataset of computed tomography angiography images of acute pulmonary embolism

Pulmonary embolism has a high incidence and mortality, especially if undiagnosed. The examination of choice for diagnosing the disease is computed tomography pulmonary angiography. As many factors can lead to misinterpretations and diagnostic errors, different groups are utilizing deep learning methods to help improve this process. The diagnostic accuracy of these methods tends to increase by augmenting the training dataset. Deep learning methods can potentially benefit from the use of images acquired with devices from different vendors. To the best of our knowledge, we have developed the first public dataset annotated at the pixel and image levels and the first pixel-level annotated dataset to contain examinations performed with equipment from Toshiba and GE. This dataset includes 40 examinations, half performed with each piece of equipment, representing samples from two medical services. We also included measurements related to the cardiac and circulatory consequences of pulmonary embolism. We encourage the use of this dataset to develop, evaluate and compare the performance of new AI algorithms designed to diagnose PE.


Background & Summary
Pulmonary embolism (PE) has a high incidence and mortality.It occurs when a blood clot, most commonly from the deep venous system, moves into the pulmonary arterial circulation 1 .Up to 300,000 deaths per year are estimated to occur in the United States due to PE 2 .Less than 10% of deaths occur among diagnosed and treated patients, indicating a potential reduction in mortality by improving the diagnostic accuracy for the disease 3 .
Computed tomography pulmonary angiography (CTPA) is the examination of choice for evaluating patients with PE 4,5 .After intravenous infusion of iodinated contrast medium, CT is performed when there is optimal opacification of the pulmonary arterial circulation, and the thrombus is identified as an intraluminal filling defect.
CTPA image interpretation is a complex task: radiologists must carefully search for contrast filling defects in the entire pulmonary arterial vasculature across a large number of images.Technical problems, patient-related factors, anatomical issues and the presence of other pathologies 6 can lead to misdiagnosis.
Computer-aided diagnosis (CAD) programs aimed at reducing these errors can reduce mortality.Several approaches have already been proposed  ; however, a definitive solution has not yet been reached. Morerecently, there has been increased interest in the creation of artificial intelligence (AI) techniques, especially using artificial neural networks (ANNs), for addressing this problem.
The diagnostic performance of these techniques is highly dependent on the dataset used for training them, as it must contain examinations as diverse as those in real applications.The diagnostic accuracy of these methods tends to increase by augmenting the training dataset 49 .
Obtaining reliable datasets is a considerable obstacle encountered by researchers, as it is time-consuming, requires radiologists with experience to recognize PE and depends on medical center cooperation.Furthermore, to be suitable for supervised learning applications, the dataset must be annotated.Two different annotation approaches have been used in the three public datasets available: a pixel-level annotation, in which all pixels of the thrombus are demarcated as a ground truth, and annotations at the image and study levels 50 , in which images with a visible PE receive a label, but the thrombus itself is not demarcated.The first approach is more versatile and can be converted into an annotation at the image level, but the opposite is not true.
The pixel-level annotation may help the algorithm to predict the exact region in which the embolus is located and to verify if this was the region used by the algorithm to generate the diagnostic output.It is especially useful for evaluating CADs designed to diagnose PE, as a high false-positive rate is a notable challenge for many of these algorithms.
To date, there are only two public datasets containing CTPA examinations with pulmonary emboli annotated at the pixel level.The first one contains 91 examinations of patients with PE obtained using SIEMENS CT scanners 51 , and the second contains 35 examinations, 33 of which were conducted in patients with PE, obtained with CT scanners from PHILIPS and Neusoft Medical System Co 52 .The information published from these datasets does not disclose the examination selection process or the inclusion and exclusion criteria.
A recently published guide for research on AI 53 highlights the importance of using datasets containing images from devices from multiple vendors due to the variability inherent in such images.
There is a shortage of public datasets with a representative sample of examinations annotated at the pixel level that the AI algorithm can process in a real clinical setting.In medical practice, there is great variability among CTPA images, which occurs due to patient-related factors, such as different biotypes and comorbidities that can obscure PE, and technical factors, such as delays in image acquisition following infusion of the contrast media or with inadequate infusion flow, that lead to suboptimal image quality.
We developed a dataset with a sample of cases of acute PE annotated at the pixel and image levels, making it suitable for algorithms developed using both approaches (Fig. 1).Our dataset contains 40 examinations performed with multidetector scanners, half from a Toshiba CT and the other half from a GE CT 54 .
The dataset was primarily used in conjunction with the two public datasets of CTPA examinations with pulmonary emboli annotated at the pixel level 51,52 to develop a program for the diagnosis of PE 48 .A method capable of finding and segmenting PEs in CT images using deep learning was subsequently developed.
We encourage the use of this dataset to develop, evaluate and compare the performance of AI algorithms designed to diagnose PE.

Methods
The study was approved by the Research Ethics Committee of Hospital de Clínicas-Federal University of Paraná.Given the retrospective nature of the study, only retrospective access to anonymized scan files was necessary, and so the need for informed consent was waived by the ethics committee.Aiming to preserve the identity of the participants, the examinations were deidentified by deleting patients' personal information, such as names, date of birth and identification numbers from the CT scans.Fields corresponding to time or numbers were replaced by "000000.00".Fields corresponding to dates were replaced by "00010101".Written Fields were replaced by "Anonymized" or removed.
Imaging.Twenty CTPA scans were performed in a public university hospital with a 64-channel Toshiba Aquilion CT scanner, with a tube voltage of 120 KVp, slice thickness of 1.0 mm, gantry rotation time of 0.5 seconds, beam pitch of 1.485 and dose modulation protocol.
The other twenty scans were performed in a private imaging practice with a GE Revolution 512 CT scanner, with a tube voltage of 120 KVp, slice thickness of 0.625 mm, slice interval slice thickness of 0.625 mm, gantry rotation time of 0.5 seconds, beam pitch of 0.9 and dose modulation protocol.

Image segmentation.
All CTPA scans were used to make a diagnosis of acute PE by a staff radiologist.The diagnosis and location of the acute pulmonary embolism were confirmed by a thoracic radiologist with 32 years of experience.After this, a third-year resident generated the pixel-level ground-truth masks, which were revised by a certified radiologist.
The examinations were segmented using manual mode in ITK-SNAP 55 , generating the ground-truth mask in which the thrombus pixels are demarcated (Fig. 2).Based on this mask, image-level segmentation was performed, labeling slices containing the thrombi.

CtPa features related to right heart strain and pulmonary artery hypertension. Obstruction
of the pulmonary vasculature due to PE can increase vascular resistance, leading to an increase in pulmonary arterial pressure and right cardiac strain.Indirect signs such as pulmonary artery dilatation, right ventricular enlargement (increase in the right ventricle-to-left ventricle diameter ratio), inferior vena cava (IVC) contrast reflux, and abnormal positioning of the interventricular septum (flattening or even paradoxically bowing toward the left ventricle), can be observed on CTPA scans (Fig. 3).
In all CTPA scans, we evaluated the largest artery involved, inferior vena cava reflux, interventricular septum flattening or paradoxical bowing, pulmonary artery trunk diameter (PAD), transverse diameters of the right ventricle (RV) and left ventricle (LV)-measured between the endocardial surfaces in the largest place perpendicular to the longitudinal axis-and right ventricle-to-left ventricle diameter ratio (Tables 1, 2).

Data records
All data records described in this paper are available on a Figshare repository 54 .This repository contains three folders.The first contains CTPA images in Digital Imaging and Communications in Medicine (DICOM) format.The second contains the ground-truth pixel-level segmentation of the location of the pulmonary embolus in Neuroimaging Informatics Technology Initiative (NiFTI) format.The segmentations consist of a three-dimensional matrix in which each element corresponds to a voxel of the CT scan: the elements corresponding to the embolus have a value of "1", and the others have a value of "0".The third is in comma separate values (CSV) format, in which each element corresponds to a slice of the CT scan.The first element represents   the first slice, and the last element represents the lowest slice.Slices in which the embolus can be visualized are represented by the number "1", and the others by the number "0".The files corresponding to patients scanned with the GE equipment are named 01GE to 20GE, and the files corresponding to patients scanned with the Toshiba equipment are named 01TS to 20TS, according to the number of each patient in Tables 1, 2, which are also available on the repository in Excel binary file (XLS) format.

technical Validation
Exam selection.The CTPA scans were selected retrospectively by a search of the digital files from a public university hospital and a private imaging practice.
An arbitrary starting date was defined for each medical facility.From the defined starting date, all CTPA scans were sequentially reviewed until we reached 20 examinations with PE from each device that fit the inclusion and exclusion criteria.
For the Toshiba device, the examinations were performed from November 5, 2018 to February 5, 2019.For the GE device, the examinations were performed from November 9, 2018 to September 20, 2019.

Inclusion criteria.
Chest CT scans performed using the PE protocol (CTPA) for diagnosing acute PE.File availability in the Picture Archive and Communication System (PACS) from each medical center.

Exclusion criteria.
Artifacts that prevented the radiologist from visually interpreting the images.

Examinations with fully or partially corrupted files. Follow-up examinations (only the first CTPA was used).
There was no restriction on age, patient status (inpatients or outpatients) or any other inclusion or exclusion criteria different from those mentioned.

Fig. 1
Fig. 1 Steps for producing the dataset.First step: definition of inclusion and exclusion criteria.Second and third steps: selection of 20 examinations performed with each machine on patients with acute PE.Fourth step: image annotation at the pixel and image levels.Fifth step: evaluation of features related to right heart strain and pulmonary artery hypertension.

Fig. 2
Fig. 2 Pixel level annotation.Column (a) shows CTPA images of pulmonary emboli in different anatomical locations.In column (b), pixel-level annotations show all pixels of the embolus of the corresponding CTPA image in column (a) in white.In column (c), the corresponding images of columns (a) and (b) are superimposed, and the thrombus is shown in red.

Table 2 .
Data from exams obtained with the GE Revolution 512.F: female.M: male.PAD: pulmonary artery diameter.RV: right ventricle diameter.LV: left ventricle diameter.RV/LV: right ventricle-to-left ventricle diameter ratio.IVC Reflux: inferior vena cava reflux of contrast media.IV Septum: interventricular septum position.