Multimodal data fusion using sparse canonical correlation analysis and cooperative learning: a COVID-19 cohort study

Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients: Intensive care unit admission. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (cor(Xu1, Zv1) = 0.596, p value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.

The number of patients who underwent chest CT at least once was 127 (Supplementary Fig. 1).In 105 isolated viral genome samples, the number of unique nucleotide mutations was 710, and the number of unique amino acid mutations was 439.The median of nucleotide mutations per strain was 29.0 (21.0-33.0),and the median of amino acid mutations was 22.5 (12.0-27.8)(Supplementary Fig. 2).Fifty-two strains (49.5%) were assigned to Variant 20I (Alpha, V1) according to the Nextclade clades (Fig. 1).

Visualization of the global SARS-CoV-2 strains using Word2Vec embedding
We visualized 300 randomly selected viral strains from each Nextclade clade in the corpus, generated with global viral genome sequences on the GISAID database.This shows that major variants, such as Variants 20I (Alpha, V1), 20H (Beta, V2), 21I, and 21J (Delta's), and Omicron clades, were successfully separated (Fig. 2).Not only separation but also some of the characteristic features seen in phylogenetic relationships were observed in the embedding space.While more ancestral clades, for instance, Variants 19A, 19B, and 20A, representing the early days of the pandemic, had a wider distribution, clades dating to later periods tend to be observed within clusters.Furthermore, clades that have closer evolutionary relationships, for example, Variants 20H (Beta, V2), 21I (Delta), and 21H (Mu) were located closer in the embedding space 26 .On the other hand, Omicron variants were separated from these groups, as highlighted in the literature 27 .It was also seen that some recombinant strains were located close to the Omicron variants, while others tended to spread toward other variants.
Unsupervised pairwise data fusion using sparse CCA Next, we performed sparse CCA analysis to examine the pairwise associations between all data modalities (Table 2).Relevant sparsity parameters corresponding to the highest Z-stat score were determined with the number of non-zero weights of X and Z.We first report the results for combining laboratory results and radiomics features for 127 patients (cor(Xu 1 , Zv 1 ) = 0.596,.In the laboratory results group, lactate dehydrogenase (LDH), which relates to disease progression and worse outcome, had the highest coefficient value (0.47), followed by erythrocyte sedimentation rate, Ddimer, polymorphonuclear leukocytes, white blood cell count, and acute phase reactants such as C-reactive protein (CRP) and fibrinogen 28 .At the same time, albumin had the lowest coefficient value (−0.46) as a negative acute phase reactant, along with hemoglobin, lymphocyte, and sodium levels.In the radiomics features group, original and wavelet features in the LLL frequency channel had the highest absolute values of coefficients.Among them, histogram-based first-order features reporting the skewness, kurtosis, and uniformity had the lowest negative coefficients, whereas entropy-related features had the highest positive coefficients.The negative coefficients of skewness and kurtosis features indicated that, as laboratory results worsened, since image density is measured approximately as −1000 Hounsfield units (HU) for air, and 0 to +70 HU for various tissue types such as blood, pleural effusion, abscess, and mucus, depending on the extent and content of the lung abnormalities, the image intensity histogram became flatter and more right-skewed 29,30 .This led to a loss of homogeneity and increased entropy in the radiomics features (Fig. 3).Sparse CCA analysis of laboratory results and clinical data revealed different clinical phenotypes.(cor(Xu 1 , Zv 1 ) = 0.63, best L1 bound for X and Z: 0.7, p = 0.04) (Fig. 4).While laboratory results' first canonical coefficients revealed a phenotype related to high creatinine levels with low albumin levels and anemia, on the clinical side, the patient appeared to be elderly and multi-morbid with moderate to severe renal disease.The second canonical variables represented a different patient phenotype who was young and immunocompromised with high ferritin levels.The third canonical variables also characterized a phenotype that likely presents a patient with a history of liver disease with elevated INR, total bilirubin levels, and hypoalbuminemia.
Next, we focused on the sparse CCA analysis of imaging and clinical data (Supplementary Fig. 3).In this experiment, the permutation-based approach for choosing parameters provided a sparser solution than previous ones; the correlation coefficient was 0.65 (p < 0.01).The first and third canonical vectors were found to be associated with sex, and the second and fourth ones were with lymphoma and CCI, respectively.Mainly, the first sex-related canonical vector coefficients belonged to the left, and the second sex-related canonical vector coefficients belonged to the right lung and consisted of first-order and shape radiomics classes.In combinations of pairwise sparse CCA analysis of viral genome sequencing data with other data modalities, different encoding techniques for the viral genome altered the correlation plots, with better separation obtained with Viral-Word2Vec encoding (Supplementary Fig. 4).
Finally, we performed sparse multi-CCA on patients with all data modalities (n = 89).When Viral-Binary encoding was used for viral encoding, the highest Z-score was 2.15 (penalties = 12.22, 6.64, 1.56, 1.7, p < 0.01).The highest Z-score was 3.14 (penalties = 17.923, 8.468, 2.293, 2.493, p < 0.01) when Viral-Word2Vec was used for viral encoding.We  showed that with the Viral-Word2Vec embedding, the viral genome projection of the patients was more homogeneously distributed, and pairwise correlation values of viral features and other data modalities were reduced (Fig. 5).While the dominance of histogram-based first-order radiomics features and the importance of albumin and LDH among the laboratory results persisted in the analysis of these 89 patients, clinical data's first canonical vectors slightly revealed a different clinical phenotype which was an elderly multimorbid patient with dementia (Supplementary Fig. 5).

Supervised multiview analysis using cooperative learning
We performed cooperative learning to build prediction models for ICU admission.First, we used all five data modalities to train unimodal prediction models in all patients.The highest score was achieved with the model using radiomics features (AUC = 0.83 ± 0.01), followed by laboratory results and clinical data (AUC = 0.77 ± 0.02 vs. 0.67 ± 0.02), respectively (Supplementary Table 1 and Supplementary Fig. 6).
Then, the models were trained with all available singular, dual, and quad data combinations and evaluated for the same task in 89 patients.The best accuracy and AUC values were achieved with the quadruple model, which combines clinical, laboratory, radiomics, and viral genome sequencing data using Viral-Word2Vec encoding (CLRW) (Supplementary Table 2 and Fig. 6).While unimodal prediction models of radiomics features and laboratory results had a mean AUC score of 0.83, the quadruple prediction model using Viral-Word2Vec encoding achieved a mean AUC score of 0.87.The mean AUC score for the quadruple model using Viral-Binary encoding was 0.83 ± 0.2.There was a statistical difference between these two quadruple models (p < 0.001).The model that combines radiomics and Viral-Word2Vec encoding also outperformed the model that combines radiomics and Viral-Binary encoding in terms of AUC scores (0.86 vs. 0.83, p < 0.001).There was no statistical difference when we used different viral embedding techniques in other models with dual data combinations.
To understand which features were the most important for the performance of the CLRW model, we used all the available data as the training set.We performed a 5-fold CV to get the optimal hyperparameters and found Alpha and rho as 0.2 and 0.1, respectively.Model standardized coefficients were extracted at λ = 0.1 (Supplementary Fig. 7).Again, like the results of the unsupervised sparse CCC analysis of radiomics and laboratory results, the original and wavelet features in the LLL frequency channel had the highest absolute values for the standardized coefficients.LDH, ESR, CRP, albumin, and total bilirubin were the selected serum biomarkers, and age, chronic disease, and CCI were the clinical variables of the CLRW model.Word2Vec encoding also contributed to the supervised task with its 4 out of 300 dimensions.

Discussion
The complexities of modern biomedical challenges demand more holistic approaches to data analysis.This study spotlights the potential of multi-modal data fusion, harnessing diverse data modalities to derive deeper insights into health phenomena.By integrating disparate data types, we can uncover nuanced patterns and relationships that single-modal analyses might miss.In particular, the potential of integrating viral genome sequencing, imaging, clinical data, and laboratory results is immense.Sparse CCA analysis and cooperative learning, as highlighted in our study, are instrumental in combining these data strands, offering a multi-faceted view of a complex disease such as COVID-19.Our exploration into using the Word2Vec NLP model for viral embedding further underscores the value of innovative techniques in transforming raw data into meaningful representations, especially in the context of viral genomic sequencing data.
Our approach has precedence.Using NLP techniques, particularly the Word2Vec model for viral embedding, has been recognized in prior research.But while earlier studies were often unimodal and focused on tasks like viral classification or evolution tracking, our method differs by integrating this with other data modalities, offering a more comprehensive view [31][32][33][34] .The merit of this method is evident in its ability to encapsulate the cumulative effects of multiple viral mutations and their relationships, a task that single-modal approaches might find challenging.For example, we illustrated that Word2Vec encoding not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them.We also found that Word2Vec encoding is beneficial in showing which groups recombinant strains are close to, which might be challenging to represent in a phylogenetic tree.
Next, imaging data, particularly CT scans, holds potentially more information than can be observed by radiologists.Radiomics offers a quantitative approach to interpreting this data, allowing correlations with clinical features and laboratory results.Previous literature has highlighted the correlation between radiological findings and other biomarkers [35][36][37] .Our analysis shows that serum biomarkers that represent positive and negative acute phase responses are mainly found to be correlated with radiomics features related to the distribution of voxel intensities.As known, in lung nodule and cancer studies, mainly shape-related features are robust and provide information about disease phenotypes and prognosis; however, our results emphasize the importance of histogram-and entropy-related features because a higher proportion of involved lung parenchyma, i.e., diffuse pulmonary infiltrates, is associated with severe disease [38][39][40] .Furthermore, we reveal that biomarkers not playing a role in acute phase response, such as ALT, potassium, and creatinine, are not correlated with radiomics features.
Sparse CCA has been championed in various biomedical domains, from understanding eating disorders via imaging data to categorizing clinical subtypes in dementia 41,42 .Our application to COVID-19 aligns with this trajectory, delineating clinical phenotypes like kidney disease, liver dysfunction, and age-related vulnerabilities, supported by existing literature and our clinical observations [43][44][45][46] .However, challenges persist.Multi-modal data fusion demands rigorous computational techniques, nuanced strategies for feature extraction, and the transformation of raw data into a structured format.Our experiments show that when these elements align-as seen with the optimal results using Word2Vec encoding-the outcomes are promising.This performance edge over Viral-Binary encoding likely arises because Word2Vec benefits from an expansive external database, capturing nuances not specific to our dataset.
There are various studies focusing on predicting a supervised task by combining imaging and clinical data in COVID-19 patients 36,[47][48][49][50][51] .It is foreseeable that viral genome sequencing will be directly integrated into clinical patient management in the coming years 52 .Yet, decisions surrounding data fusion techniques and staging remain paramount.We employed cooperative learning, enhancing alignment across modalities, to determine optimal fusion, guided by the agreement penalty tuning to distinguish which patients should be admitted to the medical ward or ICU when the relevant patient data was collected.At the end of the analysis, a review of the features selected in both our unsupervised and supervised models revealed a consistency, highlighting a harmony between an unsupervised use of sparse CCA analysis and supervised predictive cooperative learning.
A few caveats need to be noted.First, our relatively small sample size might be suffering from overfitting of data.To overcome this issue, we used a stratified nested CV framework for estimating the generalization performances of the trained models.The sparsity penalty in both sparse CCA and cooperative learning also helped mitigate the risk of overfitting.Next, from a clinical point of view, this study is dated in the early days of the pandemic, and a small number of patients were vaccinated for COVID-19.Because it is not possible to quantify the vaccination effect properly, we had to ignore this effect.
To conclude, our findings reinforce the power and potential of multimodal data fusion in biomedical research.Sparse CCA analysis and cooperative learning are pivotal tools in managing and interpreting highdimensional data, as in the example of COVID-19.The Word2Vec model, as employed for viral genome encoding, is particularly promising, hinting at future directions for research in multi-modal biomedical data fusion.

Study design and data collection
This prospective cohort study was conducted in a tertiary care academic center in Ankara, Turkey, between December 22, 2020, and May 5, 2021.• Elevated serum troponin with arrhythmia, • Lactate >2 mmol/l.
Relevant clinical information was gathered through face-to-face interviews with patients and attending physicians and by reviewing clinical records.Age, sex, comorbidities such as diabetes, heart failure, coronary artery disease, hypertension, malignancy, polymerase chain reaction (PCR) test results, vaccination history, medications and therapies, outcomes, laboratory results, GISAID EPI_SET identifiers, and imaging data were recorded.Age-adjusted Charlson comorbidity index (CCI) was calculated for each patient.Min-max normalization was performed for age and CCI.Since laboratory results can change within days or even hours in COVID-19 patients, only results at the time of imaging were included.Among recorded 22 laboratory features out of 127 patients, erythrocyte sedimentation rate (ESR) was missing in 5 (4%), and troponin-I and myoglobin were absent in 11 (8.7%) patients.The mean substitution was performed to handle these missing values.All the information related to patients has been anonymized.

Sampling and next-generation viral genome sequencing protocol
A nasopharyngeal swab or nasal specimen combined with an oropharyngeal swab was obtained from conscious patients, and a tracheal aspirate from intubated patients, in case they comply with inclusion criteria.Medical Wire M40-A Compliant Sigma-Virocult™ Viral Collection and Transport System combining open-bud Sigma-Swab™ with Virocult™ medium was used.EZ1 Virus mini kit V2.0 (Catalog number: 955134, Qiagen, Germany) was used for total nucleic acid extraction.Samples were directly introduced to the sequencing platform after nucleic acid extraction.Library preparation was performed using Respiratory Virus Oligos Panel V2 (Illumina Inc., #20044311) and Illumina RNA Prep with Enrichment, (L) Tagmentation (Illumina Inc., #20040537) kits according to official protocol.Before sequencing, the libraries were quantified and checked on Qubit 4 Fluorometer (ThermoFisher Inc.) and 2100 Bioanalyzer systems (Agilent Inc.).The sequencing was performed on Illumina NextSeq 550 platforms with 1,000,000 reads (2 × 150 bp) per sample on average.Quality control of the raw data was analyzed using the FASTQC tool.The quality passed samples were uploaded to BaseSpace (Illumina Inc.) for bioinformatics analysis.The data were analyzed using the DRAGEN COVID Lineage app (v.3.5.4) on the BaseSpace platform.The low-quality and low-variant fraction variants were filtered out (coverage >10).The filtered variants were submitted to the GISAID database 20 .
Our approach contained the following steps (Fig. 7): • Using binary encoding and leveraging the Word2vec natural language processing model (NLP) for viral encoding 59 .• Extraction of radiomics features from CT images.
• Performing canonical correlation analysis to understand relationships between data modalities and identify clinical phenotypes 23,24 .• Using cooperative learning to build prediction models for ICU admission 25 .

Viral feature preprocessing
Before utilizing any machine learning model on a genome or amino acid sequence, it is necessary to convert the sequence into a numerical format of fixed length to construct an embedding space.For this purpose, various techniques available in the literature are broadly classified as alignment-free and alignment-based methods 60 .Alignment-free methods are mainly divided into word-based and information theory-based methods.While wordbased methods rely on discovering the frequency of words (k-mers) within sequences and use similarity or dissimilarity measures derived from these patterns, information theory-based methods capture the information shared among sequences using entropy or complexity metrics 61 .On the other hand, even though alignment-based methods have some disadvantages, such as becoming computationally expensive or assuming that homologous sequences share conserved sequences, they still constitute a well-established approach in phylogenetic studies.During the COVID-19 pandemic, Pango, Nextclade, and WHO classification systems have been widely used, and all these systems essentially rely on this methodology 62,63 .Also, they were used in machine-learning prediction studies for creating viral feature embeddings 64,65 .
We tried and compared two different techniques for viral encoding: • Viral-Binary encoding: as a typical example of alignment-based methods, according to whether mutations were present or were not, each mutation was encoded as "0" or "1".In total, 439 unique amino acid mutations were identified in 105 isolated study strains.This created a binary column for each mutation and returned a sparse matrix.• Viral-Word2Vec encoding: since well-established viral genome databases exist on a global scale, we aimed to combine alignmentfree and alignment-based models leveraging the Word2Vec NLP model to reduce the size of the embedding space and extract the semantic relationship between each of the mutations and the strains themselves 59,66 .We treated amino acid mutations as words and strains as sentences and used the Skip-Gram model architecture that predicts surrounding mutations in a context window given the current mutation.To construct the corpus, viral strains whose outcomes were known and collected between December 30, 2019, and March 2, 2023, on the GISAID database were used.After collecting the data, we defined the outlier strains as falling below Q1 − 1.5 IQR or above Q3 + 1.5 IQR in terms of the number of mutations.As a result, 653,134 viral genomes were used for constructing the corpus.The maximum number of mutations per strain was found to be 115, and accordingly, this number was chosen as the context window to train the Word2Vec model with vector dimension 300.The vocabulary size (the number of unique amino acid mutations) was found to be 52,311.Strain embeddings were calculated by getting the strains' mean vector of amino acid mutations.Multi-dimensional scaling (MDS) was used to reduce dimensionality and visualize 2D plots 67 .Cosine similarities between the strains were computed to construct a dissimilarity matrix, and this matrix was used as the precomputed dissimilarity metric in MDS.

Imaging data collection and analysis
For each patient with longitudinal CT images, the images were selected based on the following criterion: Since the study outcome for the supervised task is discharge from the medical ward (non-ICU group) or ICU admission (ICU group), we selected images closest to the outcome (Supplementary Fig. 8).Image data were obtained from SIEMENS scanners.CT section thickness was as follows: less than 1 mm (7 patients, 5.5%), greater than or equal to 1, less than 2 mm (119 patients, 93.7%), and greater than or equal to 2 mm, less than 3 mm (1 patient, 0.01%).All images were visualized in 3D Slicer (v.5.3.0), and the right and left lungs were segmented with U-net-based pretrained lungmask R231 68,69 .Eighteen first-order, 14 shape, 22 Gray-level cooccurrence matrix (GLCM), 16 Gray-level size zone matrix (GLSZM), and 14 Gray-level dependence matrix (GLDM) features were extracted from original and wavelet-filtered images from each lung using pyradiomics (v.3.0.1) 70 .All images were normalized with a scale of 500."sitkBSpline" was used as the interpolator for resampling pixel space to (1.0,1.0,1.0)mm.A fixed bin number of 64 was used for all analyses.

Data fusion methodologies
Assume we have two data matrices, X and Z, of dimensions n × p and n × q on the same set of n observations are given as Eq. ( 1): Canonical correlation analysis (CCA) seeks linear combinations (canonical variables) of the variables in X and Z that are maximally correlated.That is, u 1 = (u 11 , u 21 , …, u p1 ) T and v 1 = (v 11 , v 21 , …, v q1 ) T maximize corr(Xu 1 , Zv 1 ).Here, we refer to u 1 and v 1 as the canonical vectors and Xu 1 and Zv 1 as the canonical variables.u and v have the dimensions of p × K and q × K for K canonical vectors, which are not correlated with each other.
Sparse CCA aims to find sparse canonical vectors u and v such that u T X T Zv is optimized.The analysis was conducted using the PMA (v.1.2.1) package 24 , where we used the CCA.permute function for selecting the sparsity parameters.Multi-CCA was used as an extension where we assessed the relationship between more than two data modalities.The p values were calculated through permutation with the MultiCCA.permutefunction.
Cooperative learning was used for the supervised learning task to build prediction models for ICU admission with multimodal data, which uses an agreement penalty to encourage alignment between predictions from different data modalities.By varying the weight of the agreement penalty, it provides a continuum of solutions that include the early and late fusion approaches.Using cross-validation (CV), we chose the degree of agreement, i.e., the optimal weight on the agreement penalty.Cooperative learning also combines the lasso penalty with the agreement penalty, yielding feature sparsity.Analyses were performed with the multiview (v.0.8) package 25 .
We used a repeated stratified nested CV framework for hyperparameter tuning, model selection, and assessment (Supplementary Fig. 9).Tenfold CV was performed, with the loss function as "deviance" for tuning the elastic-net mixing parameter and the weight of the agreement penalty in the inner loop.The outer loop assessed the performance of models trained in the inner loop.The final performance scores were averaged after a 5-fold CV.We conducted each experiment 30 times.

Statistical analysis
Descriptive statistics were used to calculate frequency and percent distributions.After testing assumptions of normality, the mean and standard deviation were used for continuous variables with normal distribution and the median and interquartile range for continuous variables without normal distribution.The statistical significance of the differences between groups was tested using Chi-square and Fischer's exact Chi-square tests for categorical variables, an independent two-sided t-test for continuous variables with normal distribution, and a Mann-Whitney U test for continuous variables without normal distribution.The equality of variances of the results of cooperative learning was assessed with Bartlett's test.If the hypothesis of equal variances was rejected, Welch's ANOVA was used to test the significance between three or more groups.The Games-Howell post hoc test was used as the nonparametric approach to compare pairwise results of cooperative learning.Type I error was set at 0.05 for all analyses.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Fig. 1 |
Fig. 1 | Phylogenetic tree, nucleotide substitution matrix, and Word2Vec encoding plot of isolated SARS-CoV-2 strains.a The phylogenetic tree of isolated SARS-CoV-2 strains and nucleotide substitutions in matrix form, in which the presence of substitutions is shown in dark red.b The Word2Vec encoding plot of the same strains.The nucleotide substitution matrix and Word2Vec encoding plot represent that Alpha strains appear more similar compared to non-Alpha strains.

Fig. 3 |
Fig. 3 | Sparse CCA analysis of radiomics features and laboratory results.a Correlated radiomics features.Original and wavelet features in the LLL frequency channel have the highest absolute values of coefficients.b Coefficients of the original radiomics features.c Correlated laboratory results.Coefficients of laboratory results align with serum biomarkers related to severe disease and acute phase response.d The correlation between the first set of canonical variables shows that the first pair can capture the ICU outcome.We select two patients (Patient A and Patient B) with the lowest and two patients (Patient C and Patient D) with the highest canonical variables for radiomics features.e Patient A and Patient B's CT images in axial and coronal planes have no pulmonary infiltration, whereas there are apparent findings on Patient C and Patient D's CT images for COVID-19 pneumonia.f We select and visualize 30 variables with the highest and 30 variables with the lowest coefficients among the radiomics features.g The image intensity histograms of the patients show that Patient A and Patient B have left-skewed histograms peaking around −1000 to −800 HU, consistent with air and lung parenchyma densities; however, histograms of Patient C and Patient D are flatter and more right-skewed, consistent with negative coefficients for skewness and kurtosis features and revealing a wider distribution of HU values.

Fig. 4 |
Fig. 4 | Sparce CCA analysis of laboratory results and clinical data.The first four canonical variables are provided.Different canonical variables provide different clinical phenotypes: the first canonical variables represent a patient phenotype who is elderly, multi-morbid, and has moderate to severe renal disease with high creatinine and myoglobin levels, whereas the third canonical variables represent a different patient phenotype with moderate to severe liver disease with high bilirubin and INR levels.

Fig. 5 |
Fig. 5 | Sparse multi-CCA analysis of all data modalities.a The correlation pairs plot of the first canonical vectors of four data modalities, including Viral-Binary encoding.b Using Viral-Word2Vec encoding instead of Viral-Binary encoding provides a more homogenous distribution and better separation among canonical variables.

Fig. 6 |
Fig. 6 | Unimodal and multimodal prediction models for the supervised task.The best accuracy and AUC values are achieved with the quadruple model using Word2Vec Encoding (CLRW).C clinical data, L laboratory results, R radiomics, B viral-binary encoding, W Viral-Word2Vec encoding, AUC area under the curve, ns non-significant.

Fig. 7 |
Fig. 7 | Flowchart of the methods used in the study.Unsupervised sparse canonical correlation analysis and supervised cooperative learning are used for multi-modal data fusion.Two different viral encoding techniques are performed and compared in these analyses.

Table 1 |
Clinical characteristics of the patients ICU intensive care unit, LOS length of stay, IQR interquartile range.

Table 2 |
Sparse CCA analysis for examining pairwise associations between all data modalities Sparsity parameters and correlations were calculated with the CCA.permute function.We chose the L1 bound for X and Z at the highest value of the Z-stat for each pairwise data modality.n number, E encoding.