Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models

The application of multiple omics technologies in biomedical cohorts has the potential to reveal patient-level disease characteristics and individualized response to treatment. However, the scale and heterogeneous nature of multi-modal data makes integration and inference a non-trivial task. We developed a deep-learning-based framework, multi-omics variational autoencoders (MOVE), to integrate such data and applied it to a cohort of 789 people with newly diagnosed type 2 diabetes with deep multi-omics phenotyping from the DIRECT consortium. Using in silico perturbations, we identified drug–omics associations across the multi-modal datasets for the 20 most prevalent drugs given to people with type 2 diabetes with substantially higher sensitivity than univariate statistical tests. From these, we among others, identified novel associations between metformin and the gut microbiota as well as opposite molecular responses for the two statins, simvastatin and atorvastatin. We used the associations to quantify drug–drug similarities, assess the degree of polypharmacy and conclude that drug effects are distributed across the multi-omics modalities.


Severity of drug-drug-interactions and clinical data
We investigated if any of the drug combinations had known drug-drug-interactions (DDIs).
Of the 55 drug-drug pairs that were administered in our cohort we identified 52 with DDIs registered across 26 DDI databases where 12 had registered DDIs with known clinical implications ranging from 'no effect' to 'moderate effect'. We then studied if there was an association between the severity of the DDIs and the similarity of the drug-drug pairs in the multi-omics and clinical data. Here, we found that the overall omics data had small or negative associations with Spearman's Rho between -0.19 and 0.095 that were not statistically significant. The highest correlation was between reported DDIs and the clinical response with a Spearman's Rho of 0.32 (Supplementary Figure 21). This could imply that the severity of the DDIs were best captured in the clinical representation whereas the omics profiles were maybe less affected by the level of severity for the DDIs with known clinical implications.

Supplementary Figure 1. Missing data across datasets. Per individual missingness (y-axis)
for each of the multi-omics and clinical datasets (x-axis). The lower and upper hinges correspond to the first and third quartiles (25th and 75th percentiles). The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5 × interquartile range (IQR) from the hinge. IQR is the distance between the first and third quartiles. Data beyond the ends of whiskers are outliers and are plotted individually.
Each individual datapoint is shown, however overplotting can occur, i.e. there are many more points at 100% missingness for Metagenomics than for Untargeted metabolomics. See Supplementary Table 1 for number of features in each dataset.

Supplementary Figure 2. Using a test set for selecting hyperparameters for the VAE.
When selecting the hyperparameters for the VAE in MOVE we divided the dataset into train (90%) and test (10%) and evaluated the test log-likelihood loss (y-axis) as a function of latent neurons (x-axis), number of hidden layers (boxes from left to right), number of hidden neurons in each layer (boxes from upper to lower), dropout and weight on the Kullback-Leibler divergence (dotted and colored lines indicated in legend).

Supplementary Figure 3. Using the training set reconstruction to select
hyperparameters for the VAE. In addition to assessing the test set log-likelihood we assessed the reconstructions of the training set. Here we defined an accurate reconstruction for categorical variables as the class with the highest probability corresponding to the class given by the input. For continuous variables, the accuracy was assessed by comparing the reconstructed array with the input array using cosine similarity for each individual instead of using exact matching. Sample size (n) for each of the comparisons were 10, equal to the number of data sets (omics and clinical). The lower and upper hinges correspond to the first and third quartiles (25th and 75th percentiles). The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5 × interquartile range (IQR) from the hinge. IQR is the distance between the first and third quartiles. Data beyond the ends of whiskers are outliers and are plotted individually.

Supplementary Figure 4. Assessing reconstruction stability for selecting VAE
hyperparameters. Finally, we assessed how stable the reconstructions were between runs with identical hyperparameter settings. The stability of the model was assessed by comparing cosine similarity to the same individuals when run through the first iteration and thus an average change (y-axis) of 0 indicates exactly the same reconstructions as the first iteration.
For each hyperparameter combination, we ran 5 replications. The x-axis legend indicates [no.
hidden neurons]+no. latent neurons, fraction dropout. The final hyperparameters were: one hidden layer of 2,000 neurons, a latent space of at least 100, and dropout of 0.1. Sample size (n) for each of the comparisons were 10, equal to the number of data sets (omics and clinical). The lower and upper hinges correspond to the first and third quartiles (25th and 75th percentiles). The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5 × interquartile range (IQR) from the hinge. IQR is the distance between the first and third quartiles. Data beyond the ends of whiskers are outliers and are plotted individually.

Supplementary Figure 5. Reconstruction accuracy of the selected model. Reconstruction
accuracy of each dataset when using the selected hyperparameter settings on the entire dataset of the 789 individuals. As noted above, we defined an accurate reconstruction for categorical variables as the class with the highest probability corresponding to the class given by the input. For continuous variables, the accuracy was assessed by comparing the reconstructed array with the input array using cosine similarity for each individual instead of using exact matching. The lower and upper hinges correspond to the first and third quartiles (25th and 75th percentiles). The upper and lower whiskers extend from the hinge to the highest and lowest values, respectively, but no further than 1.5 × interquartile range (IQR) from the hinge. IQR is the distance between the first and third quartiles. Data beyond the ends of whiskers are outliers and are plotted individually. Analysis of the two randomized (shuffled) data sets using T-test on residualized data compared to MOVE T-test and MOVE Bayes. We added 100 drug-omics effects sampled from N(0,1) to each of the shuffled data sets and therefore do not expect all to be significant in the statistical tests because some effects will be close to 0. Significance threshold was set at ground truth FDR 0.05.      The score ranges from 1-5 translated to "no effect" (1), "undetermined" (2), "possible effect"

Supplementary Figure 22. Effect sizes of drug and multi-omics associations stratified
per omics dataset. The z-scaled effect size (y-axis) of the significant drug and multi-omics features was stratified by the omics dataset (x-axis). As Figure 3d, but only including significant drug-omics associations. Drugs with no significant omics association for a particular data set have been omitted.    Bayes tests models based on shuffling the input data and adding 100 drug-omics effects sampled from N(0,1). We repeated the shuffling resulting in two data sets, Shuffle 1 and Shuffle 2. Here we found that the MOVE T-test approach had high TP and low FP from 10 refits with ground truth FDR of ~0.05. Note that the MOVE T-test approach is an ensemble of four different models and thus 10 refits represent 40 refits across the 4 models in total. For the MOVE Bayes test approach, we identified 30 refits as the optimal model.  Significance was set to estimated ground truth FDR 0.05 and statistical tests used were twosided.