MDF-Net for abnormality detection by fusing X-rays with clinical data

This study investigates the effects of including patients’ clinical information on the performance of deep learning (DL) classifiers for disease location in chest X-ray images. Although current classifiers achieve high performance using chest X-ray images alone, consultations with practicing radiologists indicate that clinical data is highly informative and essential for interpreting medical images and making proper diagnoses. In this work, we propose a novel architecture consisting of two fusion methods that enable the model to simultaneously process patients’ clinical data (structured data) and chest X-rays (image data). Since these data modalities are in different dimensional spaces, we propose a spatial arrangement strategy, spatialization, to facilitate the multimodal learning process in a Mask R-CNN model. We performed an extensive experimental evaluation using MIMIC-Eye, a dataset comprising different modalities: MIMIC-CXR (chest X-ray images), MIMIC IV-ED (patients’ clinical data), and REFLACX (annotations of disease locations in chest X-rays). Results show that incorporating patients’ clinical data in a DL model together with the proposed fusion methods improves the disease localization in chest X-rays by 12% in terms of Average Precision compared to a standard Mask R-CNN using chest X-rays alone. Further ablation studies also emphasize the importance of multimodal DL architectures and the incorporation of patients’ clinical data in disease localization. In the interest of fostering scientific reproducibility, the architecture proposed within this investigation has been made publicly accessible(https://github.com/ChihchengHsieh/multimodal-abnormalities-detection).


Introduction
According to the Lancet ?, 2019 witnessed a shortage of 6.4 million physicians, 30.6 million nurses, and 2.9 million pharmaceutics personnel across 132 countries worldwide, especially in Low and Medium Income Countries.The situation has worsened after the pandemic since medical staff were disproportionately affected.
Deep Learning (DL) technologies promise to deliver benefits for health systems, professionals, and the public, making existing clinical and administrative processes more effective, efficient, and equitable.These technologies have become highly popular in the medical imaging field, where a plethora of applications have been successfully addressed e.g., breast imaging ?, ?, left ventricular assessment ?, ?, dermoscopy analysis ?, ?and chest X-Rays ?, ?, ?, ?, which have gained further attention due to the recent pandemic.Despite their advantages, these systems have notable shortcomings; they require extensive amounts of labelled data to operate correctly.Explicitly labelling anomalies in large amounts of medical images requires the availability of medical experts, who are scarce and expensive.The process is time-consuming and costly, resulting in bottlenecks in research advancements ? .Additionally, the complex interconnected architectures make DL predictions opaque and resistant to scrutiny, hindering the adoption of AI systems in public health (known as the "black box" problem ?, ?, ?, ?, ?, ?, ?, ?).A system that could automatically annotate or highlight relevant regions in medical images in a similar way to humans would be extremely useful and could save millions of dollars creating breakthroughs in research in AI adoption in Healthcare ? .
Several works in the literature attempt to automatically learn regions of interest indicating the patient's clinical abnormalities.These works mainly use DL approaches which have been found to be efficient and effective on a variety of computer vision tasks ?, ?, ?, ? .Mask R-CNN ? is one of the most widely used DL architectures to predict regions with abnormalities in images.However, their predictive performance is still low and these architectures do not take into consideration the process of how expert radiologists assess and diagnose these images.The human component is completely disregarded in most DL studies to predict abnormalities in chest X-rays.This is relevant because when radiologists look at an X-ray image they experience it in a multimodal world: they see objects, textures, shapes, etc.It is the combination of these modalities that make humans capable of making mental models that generalize well with less data when compared to DL approaches.
Advancements in computer-aided diagnostics (CAD) employing radiomics and deep learning have also been explored in the literature.A compelling example can be seen in a study that presents an attention-augmented Wasserstein generative adversarial network (AA-WGAN) for fundus retinal vessel segmentation.The application of attention-augmented convolution and squeeze-excitation modules highlights regions of interest and suppresses extraneous information in the images, proving effective in segmenting intricate vascular structures ? .Further, an attention-based glioma grading network (AGGN) for MRI data shows superior performance, highlighting the key modalities and locations in the feature maps even without manually labelled tumour masks ? .Lastly, a CAD model, Cov-Net, exhibits robust feature learning for accurate COVID-19 detection from chest X-ray images, outperforming other computer vision algorithms ? .Together, these studies underscore the potential of radiomics and deep learning in improved health anomaly detection, segmentation, and grading.
Multimodal DL consists of architectures that can learn, process, and link information from different data modalities (such as text, images, structured data, etc) ?, ?, ?, ?, ?, ? .Deep learning can benefit from multimodal data in terms of generalization and performance compared to the unimodal paradigm (see for instance Azam et al. ?for a comprehensive review in medical multimodal images).In terms of multimodal DL approaches for chest X-ray images, most works in the literature focus on combining image data with text to generate reports, predict diseases or even for lesion detection by training BERT-like models ?, ?, ?, ? .However, in terms of disease classification, the medical reports associated with the images that are used for training already contain some information about the patient's diseases, which may generate biased results.
A recent literature review ?indicated that clinical data is highly informative and essential for radiologists to interpret and make proper diagnoses.To the best of our knowledge, there are no multimodal DL approaches that combine patients' clinical information with X-ray images to predict the location of abnormalities in chest X-rays.This justifies and motivates our research path in the present paper.Concretely, chest X-ray images and clinical data are aimed at addressing a critical gap in current object detection deep learning approaches in terms of fusing tabular data and image data, which are very scarce.Our interviews with radiologists reinforced the importance of clinical data in making accurate diagnoses from chest X-ray images because radiologists cannot make an accurate assessment of an X-ray image without knowing the patient's clinical data.By integrating these two types of data into our model, we aim to capture the nuanced decision-making process of radiologists more effectively.This approach allows the model to leverage not just the patterns visible in the images, but also the rich contextual information available in the clinical data, leading to a more holistic and accurate disease localization.
In this paper, we propose the Multimodal Dual-Fusion Network (MDF-Net), which is a novel architecture inspired and extended from Mask R-CNN ? .MDF-Net can fuse chest X-ray images and clinical features simultaneously to detect regions in chest X-rays with abnormalities more accurately.The proposed architecture uses a two-stage detector comprising a Region Proposal Network (RPN) of Mask R-CNN followed by an attention mechanism to extract information only from Regions of Interest (RoIs).Figure 1 presents a general description of the proposed framework.Figure 1 panel a) shows the overall architecture of the proposed model.The prediction process can be divided into three phases.The first phase aims to extract the semantics (feature maps) from input data.The two modalities are processed separately.One branch computes a feature map from the input images, and a second branch computes a feature map from clinical data.In the second phase, we conduct a fusion operation to fuse the two feature maps above to obtain a joint representation.Then, the third phase applies the final classifier to predict the bounding boxes of abnormality.In Figure 1 panel b), we integrated the triage data from MIMIC-IV ED with REFLACX in order to get corresponding clinical data for each CXR image.The integrated (multimodal) dataset allows us to perform multimodal learning with the model shown in Figure 1 panel a).In the end, we performed two ablation studies: one that investigated the impact of the different fusion methods in our architecture; and another that investigated the impact of different sets of clinical features for abnormality detection.The performance of the models was measured using Average Precision (AP) and Intersection Over predicted Bounding Boxes (IoBB) (Figure 1 panel c).
The key contributions of this work are as follows: (1) We propose a strategy to extract corresponding clinical data for CXR images from the MIMIC ?, ?, ?dataset.This strategy is then used to construct our own multimodal dataset for the abnormality detection task; (2) We propose a multimodal learning architecture and two fusion methods to fuse tabular and image data together; (3) We demonstrate the effectiveness and importance of clinical data in abnormality detection through ablation studies.

Results
Overall, our experiments show that clinical data plays an important role in abnormality detection.Using clinical data together with chest X-ray images, the proposed MDF-Net architecture achieved a significant improvement when compared to the baseline Mask R-CNN model using X-rays only.Figure 2 panel a) shows an example where the proposed MDF-Net was able to correctly predict a bilateral pulmonary edema while the Mask R-CNN (baseline) did not identify any abnormality.In terms of the number of false negatives/false positives, the proposed MDF-Net with both fusion methods always generated fewer false positives and false negatives for the majority of the chest abnormalities that it was trained on (Figure 2  We also performed ablation experiments to evaluate the effectiveness of our fusion methods and the impact of different sets of clinical features.The implementation of this work is open-sourced for further research and reproducibility ? .However, the dataset is restricted-access and requires users to fulfil PhysioNet's requirements to download and use it. To evaluate the effectiveness of each fusion method proposed in our MDF-NET, we created two Multimodal Single Fusion Networks (MSF-Net), which only apply 1-D or 3-D fusion to conduct ablation studies.Figure 2

Impact of Different Backbones
In order to thoroughly evaluate the performance of our proposed MDF-Net model and to study its robustness across different architectures, we conducted an ablation study focusing on different backbones.Figure 3 presents the obtained results.Backbone architectures play a pivotal role in deep learning models as they are responsible for the feature extraction process.For this study, we incorporated well-known architectures including MobileNet, EfficientNet, ResNet, DenseNet, and ConvNextNet, which exhibit diverse architectural designs.We systematically analyzed and compared their performance when incorporated as the backbone of our MDF-Net model for different settings: baseline (image only), using 3D fusion only and using both 3D and 1D fusion.This analysis aims to demonstrate the versatility of our proposed approach and to identify the backbone that can further enhance the performance of MDF-Net in disease localization tasks using chest X-rays and patients' clinical data.The ablation results indicate that MobileNet still outperforms all other backbones.

Impact of Different Fusion Methods
Fusion methods play an instrumental role in multimodal deep learning models, acting as a bridge that intertwines the information derived from different data modalities.To scrutinize the effectiveness and compatibility of various fusion methods within our proposed MDF-Net, we conducted an ablation study where we tested different fusion strategies.The strategies assessed included element-wise sum, concatenation followed by a linear operation, concatenation followed by a convolution operation, and the Hadamard product.Each of these methods amalgamates information in distinct ways, carrying unique assumptions about the interplay between the features derived from the image and clinical data.The results of our ablation study illustrate that the element-wise sum fusion method yielded the best performance within our MDF-Net model.This method, which combines features by adding them together element by element, appeared to be more effective at integrating the information from chest X-ray images and clinical data, thus improving the model's ability to accurately localize disease in chest X-rays.Figure 4 presents our results.

Impact of Different Clinical Features
We also investigated the impact of different sets of clinical features in the proposed MDF-Net architecture.We first used radiologists' expertise to understand how the different features affect each chest abnormality (Figure 5 panel a)).The radiologists agreed that features such as body temperature are highly relevant for the indication of certain diseases that provoke infections, such as consolidation or pleural abnormality.We also investigated if this domain knowledge is present in the dataset by making a correlation analysis between the abnormalities present in each patient and the corresponding clinical data (Figure 5

Discussion
The empirical evidence presented in the preceding section allows us to formulate the subsequent significant findings related to the proposed MDF-Net: 1. MDF-Net outperforms the Mask R-CNN (Baseline) on overall AP and AR.MDF-Net has better AP on 4 out of 5 lesions and only has a small difference (-0.17%AP) from the baseline model in detecting Pleural Abnormality.In terms of overall performance, MDF-Net yields an improvement of +12% AP and +0.85% AR.From the three findings above, we can experimentally prove that clinical data is of crucial importance and, thus, move away from the idea that using images alone can indicate reliable medical diagnosis.
Ablation Study for fusion methods.The Average Precision (AP) and Average Recall (AR) of each model are shown in Figure 2 panel c).In the previous subsection, we demonstrated the effectiveness of MDF-Net, and in this section, we use MSF-Net (1D) and MST-Net (3D) to conduct an ablation study.This experiment allowed us to test the effectiveness of each fusion method.The results are as follows: 1. Figure 2 panel c) shows MSF-Net (3D) has a larger improvement in performance (+7.96%AP, +6.47%AR) than MSF-Net (1D), which indicates the 3-D fusion is more effective than 1-D fusion.Compared to Mask R-CNN (Baseline), MSF-Net (3D) has better performance on 4 out of 5 lesions, which is the same as MDF-Net.And, We also noticed this model has the highest AR compared to all other models.
2. The MSF-Net (1D) has a slight improvement on AP by +2.58% but loss -4.508% on AR, as shown in Figure 2 panel c).
In MDF-Net (1D), the clinical data are only used to be concatenated with flattened RoIs, which means the clinical data are not involved in deciding where the Regions of Interest are (RPN output).In other words, if we split Mask-RCNN into two stages, the first stage determines the regions of interest (RoIs); and the second stage identifies the lesions inside those RoIs.In 1-D fusion, the clinical data is only perceived by the second stage, so the clinical data is not used for identifying RoIs.Consequently, this 1-D fusion only helps the final classifier to filter out regions misclassified by RPN; hence, AP increased while AR decreased.
3. When using both 1-D and 3-D fusions together, the 3-D fusion can help RPN to pick up abnormal regions better while 1-D fusion can help the final classifier to determine whether a lesion exists in the given region.
Ablation Study for Clinical Features.In order to understand the contribution and significance of clinical features, we also conducted an ablation study by giving a different combination of clinical features to MDF-Net.The performance of different combinations is shown in Figure 5 panel c).When comparing the ablation result with the necessity table (Figures 5 panel a) and 5 panel c) and correlation matrix (Figure 5 panel b), we found: • In Figure 5 panel a) radiologists stated that heartrate is less important than temperature and resprate in diagnosing a pleural abnormality.In Figure 5  • In terms of diagnosing pulmonary edema, radiologists consider temperature less important than heartrate and resprate.
The same pattern as diagnosing pleural abnormality is shown.Heartrate and resprate have higher AP than temperature when age is not used.
• Considering the effect of the feature age, (gender, heartrate) is the only combination that has a slight performance drop (-0.298%AP) when age is introduced.The age improves the ability to detect Atelectasis but damages the performance of diagnosing Consolidation.In terms of overall performance, age gains improvement in most of the combinations, which is the same pattern shown in Figure 5 panel b) that age has a higher correlation to most of the abnormalities.
• Moreover, in the same correlation matrix, we noticed the correlation between Consolidation and resprate is higher than heartrate and o2sat, and the ablation results also show resprate improves models the most.Lastly, although resprate has a high correlation with enlarged cardiac silhouette, the feature heartrate seems more important in determining this abnormality.

Conclusion
In this paper, we proposed a novel multimodal deep learning architecture, MDF-Net, and two fusion methods for multimodal abnormality detection, which can perceive clinical data and CXR images simultaneously.In MDF-Net, a spatialisation module is introduced to transform 1-D clinical data to 3-D space, which allows us to predict proposals with multimodal data.And, 1-D fusion is also used to provide clinical information to the final classifier in a residual manner.To test the performance of MDF-NET, we also propose a joining strategy to construct a multimodal dataset for MIMIC-IV.The experiments show that MDF-Net consistently and considerably outperforms the Mask R-CNN (Baseline) mode.And, both fusion methods show significant improvements in Average Precision (AP) while applying them together can achieve the best performance.
Overall, our MDF-Net improves upon the baseline Mask R-CNN by not only enhancing its ability to localize diseases in chest X-ray images but also extending its capabilities to incorporate vital clinical context, which is at the moment an important missing ingredient in the Deep Learning literature.This results in a more comprehensive and accurate diagnostic tool that can better support healthcare professionals and provide a better rationale for subsequent interpretations of the models.
In the future, we will explore the following two main directions: In this work, we can only retrieve 670 instances for our multimodal dataset, which is considered small compared to other popular datasets used for X-ray diagnosis.If there are larger-scaled datasets with clinical data available in the future, our work should also be tested on them to have a more objective evaluation.Our fusion methods can also be applied to other models, such as YOLO ?, SSD ? and DETR ? .We will incorporate other architectures to evaluate the effectiveness of our fusion methods.

Methods
The methods employed in this research comprise a combination of image processing and machine learning techniques to achieve effective disease detection in chest X-ray images.We have chosen a robust and well-tested algorithm as our foundation and modified it to better suit our specific objectives.This has led to the development of an innovative model that takes advantage of the synergy between traditional image data and structured clinical data.This combined use of data sources significantly enhances the model's diagnostic performance by providing more contextual information for accurate disease localization.The following subsection provides a detailed explanation of the model architecture used in this study.

Model Architecture
In this paper, we propose an extension of Mask R-CNN, MDF-Net.We chose Mask R-CNN as our baseline model for the following two reasons.First, Mask R-CNN is a simple state-of-the-art model for object detection and instance segmentation, with proven and established success in localizing and classifying objects within images in a variety of contexts (see for instance ?, ?, ?that testifies its success with a wide range of applications).Given our task of disease localization in chest X-ray images, Mask R-CNN's ability to provide both the class and location of disease indications made it a suitable choice.Furthermore, Mask R-CNN's flexible and modular structure (that is, easy to train) allowed us to extend and modify it to better suit our specific task and incorporate our novel elements.
The main innovation in our MDF-Net is the dual-fusion architecture that allows for the integration of both image and structured clinical data (i.e., tabular data).This is a significant departure from traditional Mask R-CNN models, which primarily work with image data alone or combine image and text data.By integrating clinical data, our model is able to consider the additional context that is crucial for accurate disease localization, thus making it more aligned with the actual diagnostic process of radiologists.Our model also includes a novel spatialization strategy, which converts clinical data into a 'pseudo-image' format that can be processed by the same convolutional layers as the image data.This is a significant innovation that allows for a more seamless and effective integration of the two data types.

6/18
Input Layer.The proposed network receives as input two different modalities: a front (AP or anterior-posterior view) view of CXR images and the respective clinical data.They are defined as: • a set of CXR images: X CXR ∈ R W ×H×C ; • a set of clinical data (numerical features): In the proposed architecture, we set the dimensions of our image input space to W = H = 512,C = 1 (since the CXR image is gray-scale).In terms of the clinical data, our input space corresponds to the number of features: n 1 = 9 (continuous variables) and n 2 = 1 (categorical variable: gender).
Since the input contains modalities with different dimensions, in order to perform fusion, we need to first implement feature engineering in this data to have the same dimensions before attempting any localized object detection learning.
The goal of this step is to transform both the image input data and the clinical data to the same shapes.To achieve this, we do the following: 1. Image transform: we extracted feature maps from I using a CNN backbone, f CXR , which in this case corresponds to MobileNetv3 ?: where the resulting dimensional space is I ∈ R W ′ ×H ′ ×D ′ .In our implementation, MobileNetv3 produced the features maps W ′ = H ′ = 16, D ′ = 64.We recognise that MobileNetv3 makes a significant reduction of our feature space, however, when we tried other CNN backbones such as ResNet, we obtained worse results and overfitting.MobileNetv3 provided the best results in our preliminary tests, hence our reason for choosing this backbone.

Clinical data encoding:
In terms of the feature maps for clinical data, the goal is to concatenate both numerical and categorical feature representations.To do so, we applied an embedding layer to the categorical features, f Emb , so as to obtain a latent vector with the same dimensions of the numerical data.Next, we concatenate the resulting latent vector f Emb (X C cat ) with X C num as follows: where in our implementation n = 64, n1 corresponds to the number of continuous features (n 1 = 9) and n 2 to the number of categorical features after being processed by an embedding layer (n 2 = 55); ∪ is the vector concatenation operation.However, the dimensionality C is different from the generated image feature maps, I .This required an operation of transforming the dimensions of the clinical data from n times to W ′ × H ′ × D ′ .To achieve this, we propose a method of spacialisation of the clinical data.

Spatialisation:
We define a spatialisation layer, f spa , as a deconvolutional layer ?followed by a convolution operation.The deconvolution takes as input the n × 1 dimensional clinical data vector and learns an upscaled representation using a sparse encoded convolution kernel ?, ? .This is given by where , and e is the number of spatialised layers (in this work, we set e = 9).After applying spatialisation, the size of S will become W × H ×C, which is the size of input image X CXR .
The primary purpose of using deconvolution, also known as transposed convolution, in this context, is to facilitate the upsampling of clinical data to a dimensionality that aligns with the input chest X-ray images.Achieving parity of dimensions between these two data types is crucial as it enables us to acquire feature maps of the same size from both datasets, thereby paving the way for a more effective fusion operation in the subsequent stages of our model.
One of the significant benefits of deconvolution layers within this process is that they render the upsampling procedure trainable.In effect, this means that our model can learn the most efficient transformation strategy for converting the clinical data into a spatial 'pseudo-image', thereby enhancing its ability to integrate with the chest X-ray images and learn from the data concurrently.

Clinical data transform:
The last step for clinical data is to extract the feature maps from spatialised clinical data S .By applying a CNN f clinical , which uses the same architecture as f CXR , we can obtain the clinical feature maps by: In the end, with the proposed spatialisation operation f spa and the following CNN f clinical , the resulting C ∈ R W ′ ×H ′ ×D ′ , which matches the dimensions of I .From this step, one can proceed to the fusion of both modalities.
The final feature map Z representing the element-wise sum fusion of both modalities is obtained by where f fused is another CNN module used for obtaining the features maps from the fused modalities, and ⊕ corresponds to the element-wise sum operation that is used for fusion.The final Z corresponds to the 3D feature map representation of the combined patient information.Next, we use this data representation as input to a Mask-RCNN architecture to perform abnormality detection.
Region Proposal Network: To perform localized abnormality detection, we use the Region Proposed Network, f rpn , of Mask-RCNN architecture to generate candidate object bounding boxes also known as proposals P given by (3) RPN learns the coordinates of the generated bounding boxes (x i , y i , w i , h i ), and the corresponding confidence score, c obj , of having an abnormality (object) in the localization of the bounding boxes.This confidence score is used to sort the generated proposals by their predictive relevance.
Using the coordinates of the computed bounding boxes, a RoIPool operation is performed to extract the corresponding Regions of Interest (RoIs), Z RoI = RoIPool(P, Z ).The RoIs result in a data structure with dimensions Z RoI ∈ R W r ×H r , where W r and H r are hyper-parameters.In our experiments, we set W r and H r to 7.
After learning the candidate RoIs, we flatten this data to serve as input to a normal dense neural network, which will perform the final classification.In order to emphasize the role of the clinical data in this classification process, we concatenate the clinical data representation, Z C , with the flattened candidate RoIs, Z RoI , before classification takes place.The role of the 1-Diffusion in the MDF-Net is to provide residual information to further pass the clinical data to deeper layers in our architecture.The final prediction ŷ is then obtained by: where ∪ represents the vector concatenation operation, ZRoI = RoIPool(P, Z ), and ŷ contains predicted classes ŷcls , bounding boxes ŷbb , binary masks ŷmask , and f cls is the final classification layer.

Number of Classes.
In this study we make object detection over five different classes: Enlarged Cardiac Silhouette, Atelectasis, Consolidation, Pleural Abnormality, and Pleural Edema.We chose to focus on five classes because they were the most representative in our dataset.We also took into account the constraints imposed by the dataset's class imbalance.The classes we have not included had insufficient examples, which would have posed significant challenges in training and evaluating our deep learning model.Training a deep learning model with insufficient data for some classes could lead to overfitting and poor generalization performance for those classes.Moreover, it could bias the model towards the classes with more data.Therefore, to ensure a reliable and robust model, we chose to focus on the five most representative classes.

Model Complexity Analysis
The overall computational complexity of the proposed architecture approximates the original Mask R-CNN model, which corresponds to the sum of the complexities of its components: where, N is the number of region proposals, C is the number of classes (five classes, in our case), H is the feature map height, and W is the feature map width The Faster-R CNN is composed of 5 main parts as follows: (1) a deep fully convolutional network, (2) region proposal network, (3) ROI pooling and fully connected networks, (4) bounding box regressor, and (5) classifier.
The deep fully convolution network consists of five convolution layers, that is based on Zeiler and Fergus's fast (smaller) model ? .Having an image I, this step extracts 256 × N × N feature maps (given the 5 convolution layers ?).This is the input of the RPN network and ROI pooling layer.In the RPN network, for each point of the feature map, there are, say K, anchors (or candidate window, typically 2000) with different scales and rations.Thus, there will be a total of N x N x K candidate widows.Non-maximum suppression allows to obtain typically 2000 ?candidate windows.This yields a complexity of O(N 2 ).
Using the candidate windows and the feature map above, RoI pooling layer divides the varied size candidate windows into an H ×W grid of sub-windows then max-pooling the values in each sub-window into the corresponding output grid cell.The complexity of this process is O(1).

Training
Once the model architecture has been established, the next crucial step involves training the model using a carefully designed loss function to achieve optimal performance.In the context of Mask R-CNN, the loss function plays a pivotal role in learning the optimal parameters of the model.A well-constructed loss function balances multiple objectives, including accurate classification of abnormalities, precise bounding box regression, and efficient object proposal.
In our training process, we incorporate five loss terms, each addressing a specific aspect of the model's learning objectives.These loss terms aim to guide the model toward achieving high precision in identifying and localizing diseases in chest X-ray images.In the following, we provide a detailed explanation of each of these loss terms.
• L cls : Cross-entropy between groundtruth abnormality y cls and predicted abnormalities ŷcls .This loss term requires the model to predict the class of abnormalities correctly in the output layer.
• L bb : Bounding box regression loss between ground-truth bounding boxes y bb and predicted bounding boxes ŷcls calculated using smooth-L 1 norm: β is a hyperparameter.In our implementation, β = 1 9 .To minimise this loss, the model has to locate abnormalities in the correct areas in the output layer.
• L mask : Binary cross-entropy loss between ground-truth segmentation y mask and predicted masks ŷmask , which requires the model to locate abnormalities at the pixel level.
• L obj rpn : Binary cross-entropy loss between ground-truth objectness y obj and predicted objectness c obj (confidence score), which requires RPN to correctly classify whether the proposals (candidate bounding boxes) contain any abnormality.
• L bb rpn : Proposal regression loss between proposals (candidate bounding boxes) p : p ∈ P and ground-truth bounding boxes y bb , which is also calculated using the same smooth-L 1 norm function for L bb .This loss term aims to improve RPN on localising abnormalities.
We used homoscedastic (task) uncertainty ? to train the proposed model using these five loss terms by dynamically weighting them for better convergence.Let L = {L cls , L bb , L mask , L obj rpn , L bb rpn }, we used SGD (stochastic gradient descent) to optimise the overall loss function where θ is the wights of MDF-Net, and α l is a trainable parameter to weigh each task/loss.
The validation of this architecture required a multimodal dataset.In this study, we used medical data, more specifically chest X-ray images from MIMIC-CXR ? and patient's clinical data from MIMIC-IV-ED ? .However, these datasets are offered separately in the literature, and a thorough data integration had to be conducted before evaluating our architecture.

Dataset
Modern medical datasets integrate both imaging and also tabular data.The latter refers to medical history and lifestyle questionnaires, where clinicians have the responsibility to combine the above two sources of information.Note also that beyond diagnosis, multimodal data (i.e., comprising tabular and image data) is crucial to the advance and understanding of diseases motivating the creation of the so-called biobanks.There are several examples of biobanks, for instance, German National Cohort ? or the UK Biobank ? that includes thousands of data fields from patient questionnaires including data from questionnaires, physical measures, sample essays, accelerometry, multimodal imaging, genome-wide genotyping.However, these datasets do not contain local image annotations of lesions.Therefore, we propose a strategy to combine MIMIC-IV ? and REFLACX ? to create our dataset, MIMIC-Eye, from scratch that meets the requirement of this work which can be accessed in physionet ? .The Medical Information Mart for Intensive Care (MIMIC) IV dataset is from two in-hospital database systems, a custom hospital-wide EHR and an ICU-specific clinical information system, in Beth Israel Deaconess Medical Center (BIDMC) between 2011 and 2019.The MIMIC-IV database is grouped into three modules, including core, hosp, and icu.In this work, only the patients' data in the core module is used.
As well as the MIMIC-IV dataset, two other MIMIC-IV subsets, MIMIC-IV ED (Emergency Department) ?, and MIMIC-IV CXR (Chest X-ray) ?, are used to create the multimodal dataset.These two datasets can be linked to the MIMIC-IV dataset with subject_id and stay_id.The MIMIC-IV ED dataset was extracted from the emergency department at the Beth Israel Deaconess Medical Center.It contains data for emergency department patients collected while they are in the ED.The triage data of the MIMIC-IV ED dataset is one source providing patients' health condition in this work, such as temperature, heart rate, resprate, etc. MIMIC-CXR is another subset of MIMIC-IV consisting of 227,835 radiographic studies and 377,110 radiographs from BIDMC EHR between 2011 -2016.In the original MIMIC-CXR dataset, the CXR images are provided in DICOM format, which allows radiologists to adjust the exposure during reading.However, to train a machine learning model, the JPG file is preferred.The author of MIMIC-IV CXR then presented MIMIC-CXR JPG dataset ? to facilitate the training process.REFLACX dataset is another subset of MIMIC-IV ED, which provides extra data from different modalities, such as eye tracking data, bounding boxes, and time-stamped utterances.The bounding boxes in REFLACX are used as groundtruth in this work.
In total ten clinical features are used in this work.The MIMIC-IV Core patients data includes only two clinical attributes, age and gender.And the other eight clinical features are extracted from the MIMI-IV ED triage data.The explanations for these eight clinical features in the MIMIC-IV documentation are: 1. temperature: The patient's temperature in degrees Fahrenheit.
2. heartrate: The patient's heart rate in beats per minute.

pain:
The level of pain self-reported by the patient, on a scale of 0-10.7. acuity: An order of priority.Level 1 is the highest priority, while level 5 is the lowest priority.
Before explaining the creation process, it is necessary to introduce some important IDs and data tables in the MIMIC-IV dataset.Four important IDs are used in MIMIC-IV to link the information across tables.They are: 10/18 • subject_id (patient_id): ID specifying an individual patient.
• stay_id: ID specifying a single emergency department stay for a patient.
• study_id: ID specifying a radiology report written for the given chest x-ray.It is rarely mentioned because we do not use the report as the groundtruth label in this paper.
And the following four tables in MIMIC-IV are used to create our multimodal abnormality detection dataset: • MIMIC-IV Core patients: Information that is consistent for the lifetime of a patient is stored in this table, including age and gender.
• MIMIC-IV ED triage: This table contains information about the patient when they were first triaged in the emergency department, including temperature, heart rate and more clinical data.
• MIMIC-IV Core edstays: Provides the time the patient entered the emergency department and the time they left the emergency department, which helps us to identify the stay_id for CXR images.
• MIMIC-IV CXR metadata: Contains the information about the CXR image (radiograph), including the time taken, height and width.

Limitations and Ethical Considerations
Although our study presents promising results, several limitations should be acknowledged.As with all scientific research, our study contains inherent limitations, primarily revolving around data selection, fusion methods, and ethical considerations: Dataset Choice.The effectiveness of our proposed model relies heavily on the quality and comprehensiveness of the input information, both images and clinical data.While using publicly available and well-established datasets such as MIMIC-CXR, MIMIC IV-ED, and REFLACX minimizes the risk of data quality issues, these datasets may only partially represent diverse global populations.The performance of our model could vary when applied to different demographic groups, and potential biases in the datasets could influence the results.
Dataset Size.Although many public datasets contain both CXR images and manual lesion annotations, unfortunately, to the best of our knowledge, we are unaware of any dataset that also contains the patients' clinical data.Due to privacy concerns, most publicly available medical image datasets do not include this kind of clinical information.Patient clinical data are sensitive and protected by strict privacy regulations.As a result, researchers often face significant challenges in obtaining datasets that combine imaging data with relevant clinical information.This policy limits the effectiveness and ability of our MDF-Net to generalize.
Limitations in Fusion Methods.For our study, we considered other fusion methods, such as the Laplacian pyramid and adaptive sparse representation ?, for the 3D fusion component in the proposed MDF-Net.However, these methods are not differentiable, which makes them incompatible with our end-to-end deep learning architecture.The backpropagation process used to train deep learning models requires the gradients (derivatives) of the loss concerning the model parameters.
Non-differentiable operations disrupt this gradient flow, which could lead to suboptimal or untrainable models.
While the development and application of multimodal deep learning technologies have the potential to enhance disease diagnosis significantly, several ethical considerations must be addressed to ensure that such technologies are used responsibly and effectively.
Impact on Healthcare Professionals.While applying deep learning technologies may streamline diagnostic procedures and alleviate the workload of healthcare professionals, we must consider the potential impact on their roles and responsibilities.In this study, we align our work with the perspective that these technologies are tools that can support, not replace, healthcare professionals.Our interviews with radiologists highlighted the importance of integrating clinical data in the image diagnosis process, emphasizing the continued need for expert knowledge and more human-centred deep learning architectures as proposed in this paper.
Impact on Trust.The black-box nature of deep learning models raises concerns about transparency and trust in AI decisions.Our multimodal DL architecture seeks to improve the interpretability of predictions by using clinical information alongside image data, providing a context for the decisions made by the model.Further development of these technologies must emphasize explainability, so healthcare professionals and patients can understand and trust the diagnoses provided by these models.By providing lesion detection of the predicted lesions, we are already promoting one layer of interpretability.For future work, we are already developing methods to translate the symbolic representation of identified lesion bounding boxes into human-level explanations incorporating domain knowledge.
Biases.Potential biases in the model's predictions, resulting from biased or unrepresentative training data, can lead to disparities in healthcare outcomes.Care must be taken to ensure that datasets used to train such models are representative of the diverse patient populations they will serve.In cases where data is imbalanced, techniques such as oversampling, undersampling, or synthetically augmenting the minority class should address this issue.However, this may reinforce potential selection biases in the dataset.For this reason, we restricted our study to the most frequent classes in the dataset (ending up in a much smaller dataset that impacted our model's performance, rather than introducing selection and sampling biases from data augmentations techniques.

Figure 1 .
Figure 1.Panel a) An overview of the proposed MDF-Net architecture.Panel b) Integration of the CXR images from MIMIC-CXR ? with the clinical data of MIMIC IV ? .Panel c) IoBB is used to evaluate the model between groundtruth and predictions.
panel c) presents the performance of Mask R-CNN (Baseline), MDF-Net and the two MSF-Nets.Figure 6 panel b) presents the evaluation results across different IoBB thresholds.
panel b)).Finally, we applied the proposed MDF-Net and tested it on different sets of clinical features to investigate their impact on the learning model (Figure 5 panel c)).

2.
Figure 6 panel a) shows the model performance on the training set (red) and test set (green).Although both models reached similar performance on the training set, MDF-Net generalised better to the test set compared to Mask R-CNN (Baseline).3.In Figure 6 panel b), MDF-Net (blue) and the other two MSF-Nets outperform Mask R-CNN (purple) on almost all IoBB thresholds.While the Mask R-CNN (Baseline) suffers at a higher IoBB threshold standard, MDF-Net still maintains a reasonable performance, which indicates that MDF-Net can better locate lesions regardless of the IoBB threshold.

Figure 7
Figure 7 panel a) shows the architecture of the original Mask R-CNN, which is the baseline model used in this work.The backbone can be any neural network that can extract feature maps from an image.The architecture of MDF-Net is shown in Figure 7 panel b).Considering the size of dataset, a small pre-trained backbone model, MobileNetv3 ?, is used in both Mask R-CNN (baseline) and MDF-Net to prevent overfitting.

Figure 3 .Figure 4 .Figure 5 .
Figure 3. Ablation study results for different backbone architectures in the MDF-Net.This table provides the average precision (AP) and average recall (AR) values obtained using MobileNet, EfficientNet, ResNet, DenseNet, and ConvNextNet and also the overall number of True Positives (TP), False Positives (FP) and (False Negatives).

Figure 6 .Figure 7 .
Figure 6.Panel a) Model generalisation analysis.From this graph, we show that clinical data and MDF-Net improve the generalization ability on the test set.Panel b) Average precision analysis for different IoBB thresholds: This chart shows MDF-Net reached the best performance when using both fusion methods.Only using 1-D or 3-D fusion methods alone can improve the model among all IoBB thresholds.
AP gender,age,temp < AP gender,age,heartrate < AP gender,age,resprate , which indicates that temperature and resprate did not bring more improvement to the model compared to heartrate when age is also used.
c), we found AP gender,heartrate < AP gender,resprate < AP gender, temperature , which follows the importance shown in the necessity table (Figures5 panel a)).However, when we introduce the age feature, we obtain the following inequalities,