Creation and validation of a chest X-ray dataset with eye-tracking and report dictation for AI development

We developed a rich dataset of Chest X-Ray (CXR) images to assist investigators in artificial intelligence. The data were collected using an eye-tracking system while a radiologist reviewed and reported on 1,083 CXR images. The dataset contains the following aligned data: CXR image, transcribed radiology report text, radiologist’s dictation audio and eye gaze coordinates data. We hope this dataset can contribute to various areas of research particularly towards explainable and multimodal deep learning/machine learning methods. Furthermore, investigators in disease classification and localization, automated radiology report generation, and human-machine interaction can benefit from these data. We report deep learning experiments that utilize the attention maps produced by the eye gaze dataset to show the potential utility of this dataset.


Background & Summary
In recent years, artificial intelligence (AI) has been extensively explored for enhancing the efficacy and efficiency of the radiology interpretation and reporting process.As the current prevalent paradigm of AI is deep learning, many of the works in AI for radiology use large data sets of labelled radiology images to train deep neural networks to classify images according to disease classes.Given the high labor cost of annotating images with the areas depicting the disease, large public training datasets often come with global labels describing the whole image 1,2 without localized annotation of the disease areas.The deep neural network model is trusted with discovering the relevant part of the image and learning the features characterizing the disease.This limits the performance of the resulting network.Furthermore, the black-box nature of deep neural networks and lack of local annotations means that the process of developing disease classifiers does not take advantage of experts' knowledge of disease appearance and location in medical images.The result is a multi-layer and nonlinear model with serious concerns with respect to explainability of its output.Furthermore, the generalization capability, (i.e., when the model is deployed to infer the class labels for images from other sources or distributions) is affected by scanner differences and/or demographic changes is a well-studied concern 3 .
In the past five decades eye tracking has been extensively used in radiology for education, perception understanding, and fatigue measurement (example reviews: [4][5][6][7] ).More recently, efforts [8][9][10][11] have used eye tracking data to improve segmentation and disease classification in Computed Tomography (CT) radiography by integrating them in deep learning techniques.With such evidence and with the lack of public datasets that capture eye gaze data in the chest X-Ray (CXR) space, we present a new dataset that can help improve the way machine learning models are developed for radiology applications and we demonstrate its use in some popular deep learning architectures.
This dataset consists of eye gaze information recorded from a single radiologist interpreting frontal chest radiographs.Dictation data (audio and timestamped text) of the radiology report reading is also provided.We also generated bounding boxes containing anatomical structures on every image and share them as part of this dataset.These bounding boxes can be used in conjunction with eye gaze information to produce more meaningful analyses.We present evidence that this dataset can help with two important tasks for AI practitioners in radiology: • The coordinates marking the areas of the image that a radiologist looks at while reporting a finding provide an approximate arXiv:2009.07386v2[cs.CV] 22 Sep 2020 region of interest/attention for that finding.Without altering a radiologist's routine, this approach presents an inexpensive and efficient method for generating a locally annotated collection of images for training machine learning algorithms (e.g., disease classifiers).Since we also share the ground truth bounding boxes, the validity of the eye tracking in marking the location of the finding can be further studied using this dataset.We demonstrate utilization of eye gaze in deep neural network training and show an improvement in performance can be obtained.
• Tracking of the eyes can characterize how radiologists approach the task of reading radiographs.Study of the eye-gaze of radiologists while reading normal and disease radiographs, presented as attention maps, reveals a cognitive workflow pattern that AI developers can use when building their models.
We invite researchers in the radiology community who wish to contribute to the further development of the dataset to contact us.

Methods
Figure 1 provides an overview of the study and data generation process.For this study we used the publicly available MIMIC-CXR Database 2,12 in conjunction with the publicly available Emergency Department (ED) subset of the MIMIC-IV Clinical Database 13 .The MIMIC-IV-ED subset contains clinical observations/data and outcomes related to some of the CXR exams in the MIMIC-CXR database.Inclusion and exclusion criteria were applied to the patient attributes and clinical outcomes (via the discharge diagnosis, a.k.a the ICD-9 code) recorded in the MIMIC-IV Clinical Database, 13 resulting in a subset of 1,083 cases that equally cover 3 conditions: Normal, Pneumonia and Congestive Heart Failure (CHF).The corresponding CXR images of these cases were extracted from the MIMIC-CXR database 2 .A radiologist (American Board of Radiology certified with over 5 years of experience) performed routine radiology reading of the images using the Gazepoint GP3 Eye Tracker 14 (i.e., eye tracking device), Gazepoint Analysis UX Edition software 15 (i.e., software for performing eye gaze experiments), a headset microphone, a desktop computer and a monitor (Dell S2719DGF) set at 1920x1080 resolution.Radiology reading took place in multiple sessions (i.e., 30 cases per session) over a period of 2 months (i.e., March -May 2020).The Gazepoint Analysis UX Edition 15 exported video files (.avi format) containing eye fixations and voice dictation of radiologist's reading along with spreadsheets (.csv format) containing eye tracker's recorded eye gaze data.The audio was extracted from the video files and saved in wav and mp3 format.Subsequently, these audio files were processed with speech-to-text software (i.e., Google Speech-to-Text) to extract text transcripts along with dictation word time-related information (.json format).Furthermore, these transcripts were manually corrected.The final dataset contains the raw eye gaze signal information (.csv), audio files (.wav, .mp3)and transcript files (.json).

Inclusion and Exclusion Criteria
Figure 2 describes the inclusion/exclusion criteria used to generate this dataset.These criteria were applied on the MIMIC-IV Clinical Database 13 to identify the CXR studies of interest.The studies were used to extract their corresponding CXR images from the MIMIC-CXR Database 2 .
We selected two clinically prevalent and high impact diseases, pneumonia and congestive heart failure (CHF), in the Emergency Department (ED) setting.We also picked normal cases as a comparison class.Unlike related CXR labeling efforts, 1 where the same labels are derived from radiology reports using natural language processing (NLP) alone, the ground truth for our pneumonia and CHF class labels were derived from unique discharge ICD-9 codes (verified by our clinicians) from the MIMIC-IV-ED tables. 13his ensures the ground truth is based on a formal clinical diagnosis and is likely to be more reliable, given that ICD-9 discharge diagnoses are typically derived from a multi-disciplinary team of treating providers after having considered all clinically relevant information (e.g., bedside observations, labs) in addition to the CXR images.This is particularly important since CXR observations alone may not always be specific enough to reach a pneumonia or CHF clinical diagnosis.The normal class is determined by excluding any ICD-9 codes that may result in abnormalities visible on CXRs and also having no abnormal labels extracted from the relevant CXR reports using CXR report labeler 16 .The code to run the inclusion and exclusion criteria is available on our GitHub repository.
In addition, our sampling criteria prioritized the strategy for getting a good number of examples of disease features across a range of ages and sex from the source ED population.The goal is to support building and evaluation of computer vision algorithms that do not overly rely on age and sex biases, which may depict prominent visual features 17 , to predict disease classes.

Software and Hardware Setup
The Gazepoint GP3 Eye Tracker 14 is an optical eye tracker that uses infrared light to detect eye fixation.The Gazepoint Analysis UX Edition software 15 is a software suite that comes along with the Gazepoint GP3 Eye Tracker and allows performing eye gaze experiments on image series.
The Gazepoint GP3 Eye Tracker 14 was set up in the radiologist's routine working environment on a Windows desktop PC computer (connected at USB 3 port).The Gazepoint Analysis UX Edition software 15 was also installed on the same computer.Each session was a standalone experiment that contained up to 30 images for reading by the radiologist.The radiologist's eyes were 28 inches away from the monitor.The choice of this number of images was intentional to avoid fatigue and interruptions and to allow for timely offline review and quality assurance of each session recordings by the rest of the team.Gazepoint Analysis UX Edition software 15 allows for 9-point calibration which occurred in the beginning of each session.In addition, Gazepoint Analysis UX Edition 15 allows the user to move to the next image either by pressing the spacebar on the keyboard when done with a case or by waiting for a fixed time.In this way the radiologist was able to move to the next CXR image when he was done with a given image, making the experiment easier.

Radiology Reading
The radiologist read 1,083 CXR images reporting in unstructured prose, same as what he would perform in his routine working environment.The goal was to simulate a typical radiology read with minimal disruption from the eye gaze data collection process.The order of the CXR images was randomized to allow a blinded radiology read.In addition, we intentionally withheld the reason for exam information from our radiologist in order to collect an objective CXR exam interpretation based only on the available imaging features.
The original source MIMIC-CXR Database 2 has the original de-identified free text reports for the same images, which were collected in real clinical scenarios where the reading radiologists had access to some patient clinical information outside the CXR image.The radiologists may even have had discussion about the patients with the bedside treating physician.Interpreting CXRs with additional patient clinical information (e.g., age, sex, other signs or symptoms) has the benefit of allowing radiologists to provide a narrower list of disease differential diagnosis by reasoning with their extra medical knowledge.However, it may also have the unintended effect of narrowing the radiology finding descriptions or subconsciously biasing what the radiologists look for in the image.In contrast, our radiologist only had the clinical information that all the CXRs came from an ED clinical setting.
By collecting a more objective read, we ensured that the CXR images used in this dataset have associated reports from both kinds of reading scenarios (read with and without patient clinical information).The goal is to broaden the range of possible technical and clinical research questions that future researchers working with the dataset may ask and explore.

Data Post-Processing
At the end of each session the radiologist exported the following information from the Gazepoint Analysis UX Edition software 15 : 1) fixation spreadsheet (.csv) containing fixation information for each case in the session, 2) eye gaze spreadsheet (.csv) containing raw eye gaze information for each case in the session, and 3) videos files (.avi) containing audio (i.e.radiologist's dictation) along with his eye gaze fixation heatmaps per session case (see Figure 4).These files were uploaded and shared over IBM's internal BOX TM subscribed service.A team member reviewed each video for any technical quality issues (e.g., corrupted file, video playback stopped abruptly, bad audio quality).Once data collection (i.e., 38 sessions) finished, the following post-processing tasks were performed.

Spreadsheet Merging
From all sessions (i.e., folders), the fixations spreadsheets were concatenated into a single spreadsheet file: fixations.csv,and the raw eye gaze spreadsheets were concatenated into a single spreadsheet file: eye_gaze.csv.Mapping of eye gaze and fixation from screen coordinate system to the original MIMIC image coordinate system was also performed at this stage.Detailed descriptions of these tables are provided in the Data Records section.

Audio Extraction and Transcript Generation
For each session video file (i.e., containing radiologist's eye gaze fixations and dictation in .aviformat, Figure 4) the dictation audio was extracted and saved in audio.wavand audio.mp3files.We used Google Speech-to-Text to transcribe the audio (i.e., wav file) into text.Transcribed text was saved in transcript.jsoncontaining timestamps and corresponding words based on the API example found in documentation.Furthermore, the transcripts were corrected manually by three (3) team members (all verified by the radiologist) using the original audio.An example of a transcript json is given in the Data Records section.

Segmentation Maps and Bounding Boxes for Anatomies
Two supplemental datasets are also provided to enrich this dataset: • Bounding boxes: An extension of a bounding box extraction pipeline 18 was used to extract 17 anatomical bounding boxes for each CXR image, which include: 'right lung', 'right upper lung zone', 'right mid lung zone', 'right lower lung zone', 'left lung', 'left upper lung zone', 'left mid lung zone', 'left lower lung zone', 'right hilar structures', 'left hilar structures', 'upper mediastinum', 'cardiac silhouette', 'trachea', 'right costophrenic angle', 'left costophrenic angle', 'right clavicle', 'left clavicle'.These zones cover the clinically most important anatomies on a Posterior Anterior (PA) CXR image.These automatically produced bounding boxes were manually corrected (when required).Each bounding box is described by the top left corner point (X X1 , Y Y1 ) and bottom right corner point (X X2 , Y Y2 ) on the original CXR image coordinate system.Figure 6 shows an example of anatomical bounding boxes.The information for bounding boxes of the 1,083 images are contained in bounding_boxes.csv Researchers can utilize these two (2) supplemental datasets to improve segmentation and disease localization algorithms by combining them with the eye gaze data.In the Statistical Analysis on Fixations subsection we utilize the bounding_boxes.csvto perform statistical analysis between fixations and condition pairs.

Data Records
An overview of the released dataset with their relationships is provided in Figure 7. Specifically four (4) data documents and one (1) folder are provided: 1. master_sheet.csv:Spreadsheet containing MIMIC DICOM ids along with study clinical indication sentence, report derived finding labels, and ICD-9 derived outcome disease labels.
2. eye_gaze.csv:Spreadsheet containing raw eye gaze data as exported by Gazepoint Analysis UX Edition software. 15 fixations.csv:Spreadsheet containing fixation data as exported by Gazepoint Analysis UX Edition software.15 4. bounding_boxes.csv:Spreadsheet containing bounding box coordinates for key frontal CXR anatomical structures.
The dataset is hosted at PhysioNet 19 .To utilize the dataset, the only requirement for the user is to obtain Physionet access to the MIMIC-CXR Database 2 in order to download the original MIMIC CXR images in DICOM format.The dicom-id tag found throughout all the dataset documents maps records to the MIMIC CXR images.A detailed description of each data document is provided in the following subsections.

Master Spreadsheet
The master spreadsheet (master_sheet.csv)provides the following key information: • The dicom-id column maps each row to the original MIMIC CXR image as well as the rest of the documents in this dataset.
• The study-id column maps the CXR image/dicom to the associated CXR report, which can be found from the source MIMIC-CXR dataset 2 .
• For each CXR study (study-id), granular radiology 'finding' labels have been extracted from the associated original MIMIC reports by two different NLP pipelines -first is the CheXpert NLP pipeline 1 , and second is an NLP pipeline developed internally 16 .
• Additionally, for each CXR study (study-id), the reason for exam indication has been sectioned out from the original MIMIC CXR reports.The indication sentence(s) tend to contain patient clinical information that may not otherwise be visible from the CXR image alone.
Table 1 describes in detail each column found in the master spreadsheet.

Fixations and Eye Gaze Spreadsheets
The eye gaze information is stored in two (2) files: a) fixations.csv,and b) eye_gaze.csv.Both files were exported by the Gazepoint Analysis UX Edition software 15 .Specifically, the eye_gaze.csvfile contains one row for every data sample collected from the eye tracker, while fixations.csvfile contains a single data entry per fixation.The Gazepoint Analysis UX Edition software 15 generates the fixations.csvfile from the eye_gaze.csvfile by averaging all data within a fixation to estimate the point of fixation based on the eye gaze samples, stopping when a saccade is detected.Table 2 describes in detail each column found in the fixations and eye gaze spreadsheets.

Bounding Boxes Spreadsheet
The bounding boxes spreadsheet contains the following information: • dicom_id: DICOM ID as provided in MIMIC-CXR Database 2 for each image.
• bbox_name: These are the names for the 17 rectangular anatomical zones that bound the key anatomical organs on a frontal CXR image.Each lung (right and left) is bounded by its own bounding box, as well as subdivided into common radiological zones (upper, mid and lower lung zones) on each side.The upper mediastinum and the cardiac silhouette (heart) bounding boxes make up the mediastinum anatomy.The trachea is a bounding box that includes the visible tracheal air column on a frontal CXR, as well as the beginnings of the right and left main stem bronchi.The left and right hilar structures contain the left or right main stem bronchus as well as the lymph nodes and blood vessels that enter and leave the lungs in the hilar region.The left and right costophrenic angles are key regions to assess for abnormalities on a frontal CXR.The left and right clavicles can have potential fractures to rule out, but are also important landmarks to assess whether the patient (hence the anatomies on the CXRs) are rotated or not (which affects the appearance of potential abnormalities).Some of the bounding boxes (e.g clavicles) could be missing for an image if the target anatomical structure is cut off from the field of view of the CXR image.
• x1: x coordinate for starting point of bounding box (upper left).
• y1: y coordinate for starting point of bounding box (upper left).
• x2: x coordinate for ending point of bounding box (lower right).
• y2: y coordinate for ending point of bounding box (lower right).
Please see Figure 6 for an example of all the anatomical bounding boxes.

Audio, Segmentation Maps and Transcripts
The audio_segmentation_transcripts folder contains subfolders for all the cases in the study with case dicom_id as name.Each subfolder contains: a) the dictation audio file (mp3, wav), b) the segmentation maps of anatomies (png), as described in Segmentation Maps and Bounding Boxes for Anatomies subsection above, and c) the dictation transcript (json).The dictation transcript.jsoncontains the following tags: • full_text: The full text for the transcript.
• time_stamped_text: The full text broken into timestamped phrases: • phrase: Phrase text in the transcript.
• begin_time: The starting time (in seconds) of dictation for a particular phrase.
• end_time: The end time (in seconds) of dictation of a particular phrase.
Figure 8 shows the structure of the audio_segmentation_transcripts folder, while Figure 18 shows a transcript json example.

Technical Validation
We subjected two aspects of the released data to reliability and quality validation: eye gaze and transcripts.The code for the validation tasks below can be found on our GitHub repository.

Validation of Eye Gaze Data
As mentioned in the Preparation of Images subsection, a calibration image was interjected randomly within the eye gaze sessions to measure the error of the eye gaze on the X-and Y-axis (Figure 3).A total of 59 calibration images were presented throughout the data collection.We calculated the error by using the fixation coordinates of the last entry of each calibration image (i.e. the final resting fixation by the radiologist on the calibration mark).The overall average percentage error on X, Y axes was calculated with (error_X, error_Y) = (0.0089 , 0.0504), and std: (0.0065, 0.0347) respectively.In pixels, the same error was: (error_X, error_Y) = (17.0356, 54.3943), with std: (12.5529, 37.4257) respectively.

Validation of Transcripts
As mentioned in the Preparation of Images subsection, transcripts were generated using Google Speech-to-Text on the dictation audio with timestamps per dictated word.The software produced two (2) types of errors: • Type A: Incorrect identification of a word at a particular time stamp (please see example in Figure 9).
• Type B: Missed transcribed phrases of the dictation (please see example in Figure 10).
The transcripts were manually corrected by three (3) experts and verified by the radiologist.Both types of errors were completely corrected.For Type B error, the missing text (i.e., more than one (1) word) was added with an estimation of the begin_time and end_time manually.To measure the potential error in the transcripts, the number of phrases with multiple words in a single time stamp was calculated (i.e., Type B error):

Statistical Analysis on Fixations
We performed t-test analysis to measure any significant differences between fixations for each condition within anatomical structures.More specifically, we performed the following steps: Figure 11 shows the duration of fixations per image for each disease condition and anatomical area, while Figure 12 shows p-values from each t-test.Fixations on Normal images are significantly different from Pneumonia and CHF.More fixations are made for images associated with either the Pneumonia of CHF final diagnoses.Moreover, fixations for the abnormal cases are mainly concentrated in anatomical regions (i.e., lungs and heart) that are relevant to the diagnosis, rather than distributed at random.Overall, the fixations on Pneumonia and CHF are comparatively similar, although still statistically different (e.g., Left Hilar Structure, Left Lung, Cardiac Silhouette, Upper Mediastinum).These statistical differences demonstrate that the radiologist's eye tracking information provides insight into the condition of the patient, and shows how a human expert pays attention to the relevant portions of the image when interpreting a CXR exam.The code to replicate the t-test analysis can be found on the GitHub repository.

Usage Notes
The dataset is hosted on PhysioNet 19 .The user is also required to apply for access to MIMIC-CXR Database 2 to download the images used in this study.Our GitHub repository provides detailed description and source code (Python scripts) on how to use this dataset (e.g.post-processing, machine learning experiments) and reproduce the published validation results.The data in the MIMIC dataset has been previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research.

Use of the Dataset in Machine Learning
To demonstrate the effectiveness and richness of the information provided in this dataset, we performed two sets of machine learning multi-class classification experiments leveraging the eye-gaze data.These experiments are provided as dataset applications in simple and popular network architectures and they can function as a starting point for researchers.
Both experiments used the eye gaze heatmap data to predict the multi-class classification of the aforementioned classes (i.e.Normal, CHF, Pneumonia in Table 1) and compare the performances with and without the eye gaze information.Our evaluation metric was AUC (Area Under the ROC Curve).The first experiment deal with leveraging information from the temporal eye gaze fixation heatmaps and the second uses static eye gaze fixation heatmaps.In contrast to temporal fixation heatmaps, static fixation heatmaps is the aggregation of all the temporal fixations into a single image.

Temporal Heatmaps Experiment
The first model consists of a neural architecture, where the image and the temporal fixation heatmaps representations are concatenated before the final prediction layer.We denote an instance of this dataset as X (i) , which includes the image CRX and the sequence of m temporal fixation heatmaps where k ∈ {1, ..., m} is the temporal heatmap index.To acquire a fixed vector CRX representation v CRX , the image is passed through a convolutional layer with 64 filters of kernel size 7 and stride 2, followed by max-pooling, batch normalization and a dense layer of 128 units.The baseline model consists of the aforementioned image representation layer, combined with a final linear output layer that produces the classification prediction.Additionally, for the eye gaze, each heatmap is passed through a similar convolutional encoder and then the sequence of heatmaps is summarized with a 1-layer bidirectional LSTM with self-attention 20,21 .We denote the heatmap representation as u (i) eyegaze .Here, the image and heatmaps representations are concatenated before passed through the final classification layer.Figure 13 shows the full architecture.We train with Adam 22 , 0.001 initial learning rate and triangular schedule with fixed decay 23 , 16 batch size and 0.5 dropout 24 .The experimental results in Figure 14 show that incorporating eye gaze temporal information, without any preprocessing, filtering or feature engineering, results in 4% AUC improvement for this prediction task, when compared to the baseline model with just CXR image data as input.

Static Heatmaps Experiment
The previous section showed the use of temporal fixation heatmaps with improvements demonstrated on a simple network architecture over baseline.In this experiment, we pose the classification problem in the U-Net architecture framework 25 with an additional multi-class classification block at the bottleneck layer (see Figure 15).The encoding and bottleneck arm of the U-Net can be any standard pre-trained classifier without the fully connected layer.The two combined will act as a feature encoder for the classifier and the CNN decoder part of the network runs deconvolution layers to predict the static eye gaze fixation heatmaps.The advantage is that we can jointly train to output the eye-gaze static fixation heatmap as well as predict the multi-class classification.Then, during testing on unseen CXR images, the network can predict the disease class and produce a probability heatmap of the most important locations pertaining to the condition.
We used a pretrained EfficientNet-b0 26 as the encoder and bottleneck layers.The classification head was an adaptive average pooling followed by flatten, dropout 24 and linear output layers.The decoder CNN consisted of three convolution followed by upsampling layers.The loss function was a weighted combination (γ) of the classification and the segmentation losses both of which used a binary cross entropy loss function.The baseline network consisted of just the encoder and the bottleneck arm followed by the classification head.
The hyper-parameter tuning for both the U-Net and the baseline classifier was performed using the Tune library 27 and the resulting best performing tune is shown in Table 3. Figure 16 shows the U-Net and baseline AUCs.Both had similar performance.However, for this experiment, we are interested in seeing how network interpretability improved with the use of static eye gaze heatmaps.Figure 17 shows a qualitative comparison of the Grad-CAM 28 .The Grad-CAM approach is one of the common methods to visualize activation maps of convolutional networks.While the Grad-CAM based heatmaps don't clearly highlight the disease locations, we see clearly that the heatmap probability outputs of the U-Net highlights similar regions to what the static eye-gaze heatmap shows.
With both experiments we tried to demonstrate different use cases of the eye gaze data into machine learning.With the first experiment we wanted to show how eye gaze data can be utilized in a human-machine setting where radiologist's eye gaze information are fed into the algorithm.The second experiment shows how eye eye gaze information can be used towards explainability purposes through generating verified activation maps.We intentionally did not include other modalities (audio, text) because of the complexity of such experiments and the scope of this paper (i.e., dataset description).We hope that these experiments can serve as a starting point for researchers to explore novel ways to utilize this multi-modal dataset.

Limitations of study
Although this study provides a unique large research dataset, we acknowledge the following limitations: 1.The study was performed with a single radiologist.This can certainly bias the dataset (lacks inter-observer variability) and we aim to expand the data collection with multiple radiologists in the future.However, given the relatively large size and richness of data from various sources (i.e.multi-modal), we believe that the current dataset already holds great value to the research community.In addition, we have shown with preliminary machine learning experiments that a model trained to optimize on a radiologist's eye tracking pattern has improved diagnostic performance as compared to a baseline model trained with weak image-level labels.
2. The images used during the radiology reading were in 'png' format and not in DICOM.That's because the Gazepoint Analysis UX Edition 15 doesn't support DICOM format.This had the shortcoming that the radiologist could not utilize windowing techniques.However the png images were prepared using the windowing information in the original DICOM images.
3. This dataset includes only Posterior Anterior (PA) CXR images as selected from the inclusion/exclusion criteria (Figure 2).This view position criterion was clinically chosen because of its higher quality images compared to Anterior Posterior (AP) CXR images.Therefore, any analysis (e.g., machine learning models trained on only this dataset) may suffer from generalizability to AP CXR images.
• Inclusion and exclusion criteria on MIMIC dataset (see details in Inclusion and Exclusion Criteria section) • Case sampling and image preparation for eye gaze experiment (see details in Preparation of Images section)

Data Post -Processing
• Speech-to-text on dictation audio (see details in Audio Extraction and Transcript Generation section) • Mapping of eye gaze coordinates to original image coordinates (see details in Fixations and Eye Gaze Spreadsheets section) • Generate heatmap images (i.e temporal or static) and videos given eye gaze coordinates.The temporal and static heatmap images were used in our demonstrations of machine learning methods in Use of the Dataset in Machine Learning section.

normal-reports
No affirmed abnormal finding labels or descriptors documented in the original MIMIC-CXR reports, extracted using an internal CXR labeling pipeline 16 .

Normal
No abnormal chest related final diagnosis from the Emergency Department (ED) discharge ICD-9 records AND have normal-reports as defined above.

CHF
A clinical diagnosis of heart failure (includes ICD-9 for congestive heart failure, chronic or acute on chronic heart failure) from the ED visit as determined from the associated ICD-9 discharge diagnostic code.

CNT
The counter data variable is incremented by 1 for each data record sent by the server.Useful to determine if any data packets are missed by the client.

TIME(in secs)
The time elapsed in seconds since the last system initialization or calibration.The time stamp is recorded at the end of the transmission of the image from camera to computer.Useful for synchronization and to determine if the server computer is processing the images at the full frame rate.For a 60 Hz camera, the TIME value should increment by 1/60 seconds.

TIMETICK(f=10000000)
This is a signed 64-bit integer which indicates the number of CPU time ticks for high precision synchronization with other data collected on the same CPU.

FPOGX
The X-coordinates of the fixation POG, as a fraction of the screen size.(0,0) is top left, (0.5,0.5) is the screen center, and (1.0,1.0) is bottom right.

FPOGY
The Y-coordinates of the fixation POG, as a fraction of the screen size.(0,0) is top left, (0.5,0.5) is the screen center, and (1.0,1.0) is bottom right.

FPOGS
The starting time of the fixation POG in seconds since the system initialization or calibration.

FPOGD
The duration of the fixation POG in seconds FPOGID The fixation POG ID number

FPOGV
The valid flag with value of 1 (TRUE) if the fixation POG data is valid, and 0 (FALSE) if it is not.FPOGV valid is TRUE ONLY when either one, or both, of the eyes are detected AND a fixation is detected.FPOGV is FALSE all other times, for example when the subject blinks, when there is no face in the field of view, when the eyes move to the next fixation (i.e. a saccade) BPOGX The X-coordinates of the best eye POG, as a fraction of the screen size.

BPOGY
The Y-coordinates of the best eye POG, as a fraction of the screen size.

BPOGV
The valid flag with value of 1 if the data is valid, and 0 if it is not.

LPCX
The X-coordinates of the left eye pupil in the camera image, as a fraction of the camera image size.

LPCY
The Y-coordinates of the left eye pupil in the camera image, as a fraction of the camera image size.

LPD
The diameter of the left eye pupil in pixels

LPS
The scale factor of the left eye pupil (unitless).Value equals 1 at calibration depth, is less than 1 when user is closer to the eye tracker and greater than 1 when user is further away.

LPV
The valid flag with value of 1 if the data is valid, and 0 if it is not.

RPCX
The X-coordinates of the right eye pupil in the camera image, as a fraction of the camera image size.

RPCY
The Y-coordinates of the right eye pupil in the camera image, as a fraction of the camera image size.

RPD
The diameter of the right eye pupil in pixels

RPS
The scale factor of the right eye pupil (unitless).Value equals 1 at calibration depth, is less than 1 when user is closer to the eye tracker and greater than 1 when user is further away.

RPV
The valid flag with value of 1 if the data is valid, and 0 if it is not.
12/21 BKID Each blink is assigned an ID value and incremented by one.The BKID value equals 0 for every record where no blink has been detected.BKDUR The duration of the preceding blink in seconds.

BKPMIN
The number of blinks in the previous 60 second period of time.

LPMM
The diameter of the left eye pupil in millimeters.

LPMMV
The valid flag with value of 1 if the data is valid, and 0 if it is not.

RPMM
The diameter of the right eye pupil in millimeters.

RPMMV
The valid flag with value of 1 if the data is valid, and 0 if it is not.

SACCADE-MAG
Magnitude of the saccade calculated as distance between each fixation (in pixels).

SACCADE-DIR
The direction or angle between each fixation (in degrees from horizontal).X_ORIGINAL The X coordinate of the fixation in original DICOM image.Y_ORIGINAL The Y coordinate of the fixation in original DICOM image.

•
Total number of phrases: 19,499 • Number of phrase with single words: 18,434 • Number of phrases with multiple words: 1

Figure 2 . 21 Figure 3 .
Figure2.Sampling flowchart for selecting images for this study from the MIMIC-IV (the ED subset) and the MIMIC-CXR datasets.2,13

Figure 4 .
Figure 4. Sample video exported from Gazepoint Analysis UX Edition 15 showing a CXR case image with overlayed fixations.

Figure 5 .
Figure 5. From Left to Right: CXR image, Left lung, Right lung, Aortic knob and Mediastinum.

Figure 6 . 21 Figure 7 .Figure 8 .Figure 9 . 21 Figure 10 .Figure 11 . 21 Figure 13 .
Figure 6.Sample CXR case with 17 overlaying anatomical bounding boxes.The anatomies in the chest overlay one another on CXRs since the image is the 2D X-ray shadow capture of a 3D object.

Figure 17 .
Figure 17.Qualitative comparison of the interpretabiltiy of U-Net based probability maps in comparison with GradCAM.From Left to Right: CXR image, GradCAM from Baseline Model, GradCAM from U-Net Encoder, Static EyeGaze Heatmap and U-Net Prob Map.

Table 1 .
Validation of eye gaze fixation quality using calibration images (see details in Validation of Eye Gaze Data) • Validation of quality in transcribed dictations (see details in Validation of Transcripts section) • The t-test for eye gaze fixations for each anatomical structure and condition pairs (see details in Statistical Analysis on Fixations section) 4. Machine Learning Experiments, as described in Use of the Dataset in Machine Learning section Software requirements are listed in the GitHub repository Online-only Master Spreadsheet

Table 2 .
Online-only Fixations and Eye Gaze Spreadsheets

Table 3 .
Best performing hyper-parameters used for the static heatmap experiments found using the Tune 27 library.