BRAX, Brazilian labeled chest x-ray dataset

Chest radiographs allow for the meticulous examination of a patient’s chest but demands specialized training for proper interpretation. Automated analysis of medical imaging has become increasingly accessible with the advent of machine learning (ML) algorithms. Large labeled datasets are key elements for training and validation of these ML solutions. In this paper we describe the Brazilian labeled chest x-ray dataset, BRAX: an automatically labeled dataset designed to assist researchers in the validation of ML models. The dataset contains 24,959 chest radiography studies from patients presenting to a large general Brazilian hospital. A total of 40,967 images are available in the BRAX dataset. All images have been verified by trained radiologists and de-identified to protect patient privacy. Fourteen labels were derived from free-text radiology reports written in Brazilian Portuguese using Natural Language Processing.

www.nature.com/scientificdata www.nature.com/scientificdata/ Figure 1 provides an overview of the dataset generation process 21,22 . Data collection. Ethical statement. The project was approved by the Institutional Review Board of Hospital Israelita Albert Einstein (#35503420.8.0000.0071). Requirement for individual patient consent was waived. The study database was anonymized, with all identifiable patient information removed, including the dates of acquisition of the radiographs.

Methods
Data source. All data was obtained from Hospital Israelita Albert Einstein (HIAE). Images were extracted from PACS (Picture Archiving and Communication System). All chest radiography studies with available reports in the institutional PACS were considered for inclusion. We selected 24,959 high-quality digital chest radiographic studies acquired prior to the COVID-19 pandemic. Radiographs with burned-in sensitive data (i.e., patient name, patient identity, and image display specifications) were excluded, as well as images with rare prosthesis that could facilitate patient identification. Figure 2 shows the BRAX dataset flowchart. A subset of 294 images was excluded from BRAX so that it could be used as a hidden test set for further evaluation of machine learning models. Those interested may run their models on this (not publicly available) subset, upon request to the corresponding author.
Anonymization procedure. DICOM header anonymization was accomplished using an algorithm developed in-house based on a previously described procedure 23,24 and following the rules of the MIRC ClinicalTrialProcessor (CTP) DICOM Anonymizer 25 . The application removed DICOM metadata that could be used to identify patients, without compromising the relevant clinical information. We also added an extra conservative step by removing any free-text fields contained in the header. www.nature.com/scientificdata www.nature.com/scientificdata/ dates), retaining only the original time intervals between study acquisitions, so that chronological information is not lost. Images were reviewed by a board-certified radiologist (E.P.R.) with over 2 years of experience to identify burned-in sensitive data according to the exclusion criteria mentioned above. The images were also double-checked by 5  Data preparation. Image preparation. All DICOM images were kept with the original uncompressed information and no transformation was applied in the space or contrast domains. In order to facilitate access to researchers, we used the open source SimpleITK 26 python script 27 to convert the DICOM images to PNG. The output image width was set to 1024 pixels, and grayscale images with high dynamic range were rescaled to [0,255] through intensity windowing (window width and window level were extracted from the DICOM metadata) before conversion to the new format. During rescaling, the intensity of the pixel values (obtained on the DICOM tag "PhotometricInterpretation") is checked to determine whether they need to be inverted, so that air in the image appears white (highest pixel value), while the outside of the patient's body appears black (lowest pixel value).
Radiology reports preparation. All CXR images and reports were reviewed by at least one board-certified member of the radiology staff specialized in cardiothoracic imaging. To reduce inter-observer variability, reports -written in Brazilian Portuguese -are given in a standardized manner, according to the clinical indication. Radiology reports were originally stored in free-text form. Titled sections (i.e., detailed description of all findings and impressions) were based on templates. Example of a typical report is shown in Fig. 3A.
Automated labeling of the radiology reports. We implemented an automated extraction of labels from free-text radiology reports based on natural language processing. This process was based on two freely available tools: NegEx 28 and CheXpert Label Extraction Algorithm 12 .
Brazilian Portuguese extension of NegEx for Detection of Negation and Uncertainty (1) We translated the NegEx trigger terms (i.e. a list of words that precede negation and uncertainty) from English to Brazilian Portuguese using the Google Sheet built-in function for Google Translate 29 in order to speed up the process of human verification (Fig. 2B). (2) Three Brazilian radiologists reviewed the translated triggers and also included new ones to the NegEx lexicon, based on expressions related to negation and uncertainty specific to the radiology domain.
BRAX labeler: an expansion of the CheXpert Labeler for Brazilian Portuguese (1) The BRAX labeler was built upon CheXpert Labeler Algorithm 12 to derive labeling from both the findings and impression sections of radiological reports written in Brazilian Portuguese (Fig. 3). Fourteen labels -Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural effusion, Pneumonia, Pneumothorax, Enlarged cardiomegaly, Lung lesion, Lung opacity, Pleural other, Fracture, Support Devices, No Finding (Table 1)representing the most common chest radiographic observations (Fig. 4), and used in previous studies 12,19 , were adapted to Brazilian Portuguese 30 . We have chosen to use the same labels from CheXpert 12 because they have also been used in other large chest x-ray datasets, such as MIMIC-CXR 19 and ChestX-Ray8 17 . (2) Brazilian Portuguese radiological terms 30 for each label were created based on CheXpert 12 through an iterative process that involved a cardiothoracic radiologist (MCBS) and other general radiologists (EPR,HL) and then validated by senior cardiothoracic radiologists (GT, GS), according to the frequency and relevance of findings.

Data Records
BRAX dataset provides 40,967 images, 24,959 imaging studies for 19,351 patients presenting to the Hospital Israelita Albert Einstein. An overview of the released dataset folder structure is provided in Fig. 5. All data are available on PhysioNet 21,22 . Access is controlled and requires the user to register, complete a credentialing process, and sign a data use agreement (see usage notes). The BRAX project page on PhysioNet describes the dataset and informs users how they may apply for access.
File organization. Image files are provided in individual folders. PatientID refers to the unique identifier for a single patient. The same patient can have multiple studies. A collection of images associated with a single report is referred to as a study and is identified by the AccessionNumber. Radiograph images in different view positions (usually frontal or lateral views) can be found in different or the same series depending on modality and how the DICOMs were generated during acquisition. An example of the Anonymized_DICOMs folder structure for a single patient's images is provided in Fig. 6. The folder name starts with "id" followed by the number for the PatientID DICOM Tag. This example patient has two radiographic studies. The study folder name starts with Study followed by the number for the StudyInstanceUID DICOM Tag. Each study has one or more series folders, starting with Series followed by the number for the SeriesInstanceUID DICOM Tag. Finally, inside each series folder you may find one or more x-ray DICOM files, with the image file name starting with "image" followed by the number for the SOPInstanceUID DICOM Tag. All identifiers were randomly generated, and their order is not associated with the chronological order of the actual studies. BRAX contains: • "Anonymized_DICOMs" folder -all DICOM images organized in sub-folders according to patient identifier, study, series and image (see the section Folder Structure) • "images" folder -the same structure as the Anonymized_DICOMs folder but containing PNG files instead of DICOM • "master_spreadsheet.csv" -the main dataset table containing the identifiers for each image and associated metadata. The table provides one row per study and one separate column for each label. Columns are detailed below.
Description of columns in the master_spreadsheet.csv. DicomPath: Path to the DICOM images. As part of the de-identification procedure, the DICOM's were assigned randomly generated ID numbers.
PngPath: Path to the PNG images. PatientID: Patient's identifier. As part of the de-identification procedure, the Patient IDs were created with randomly generated numbers.
PatientSex: Patient's sex. Enumerated Values: "M" for male; "F" for female; "O" other. PatientAge: Age of the patient is provided in 5-year groups. Patients aged 85 or over are classified as "85 or more".
AccessionNumber: A DICOM identifier of the Study. As part of the de-identification procedure, the AccessionNumber was randomly generated.  Table 1. Frequency of the radiological findings. The BRAX dataset consists of 14 labeled observations. We report the number of images which contain these observations.  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation automated labeling of the radiology reports. To evaluate effectiveness of the automated labeling procedure, 1000 reports were randomly selected and reviewed by two board-certified radiologists (E.P.R e M.C.B.S) with over 2 years of experience. When necessary, labels were corrected accordingly. The performance of combining NegEx and CheXpert -automated radiology report labelers -is presented in Table 2 with sensitivity (recall), specificity, accuracy, and F1-score compared to ground truth (i.e., agreement between the two radiologists).

Usage Notes
Free-text reports are not yet provided in the current version. Future releases shall provide greater volumetry and possibly other metadata for evaluation of social determinants of health. We did not assess potential biases of gender, race or socioeconomic factors in our dataset. Use of the data requires signing a data use agreement that stipulates, among other items, that the user will not share or attempt to re-identify the data. Once approved, data can be directly downloaded from the BRAX Database project on PhysioNet 21,22 at https://doi.org/10.13026/ grwk-yh18.   Table 2. Performance of the automated labeling of the radiology reports. Performance of the automated radiology report labeler (pipeline output from NegEx and BRAX labeler) on a subset of 1,000 reports compared to the labeling agreement between two board-certified radiologists on tasks of mention extraction, negation detection and uncertainty detection, as measured by F1-score, Recall and Precision.

Code availability
The BRAX Labeler code used for the extraction of labels from Brazilian-Portuguese radiology reports is available on Github (https://github.com/edreisMD/BRAX-labeler).
To prevent the risk of patient re-identification, the anonymization code is not provided.

acknowledgements
The creation of the BRAX dataset was funded by the MIT-Brazil TVML Seed Fund award (project "Developing a Publicly Accessible Brazilian Dataset of Chest X-Rays"). Leo Anthony Celi is funded by the National Institute of Health through the NIBIB R01 grant EB017205. Tom Pollard is partially funded by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number R01EB030362.