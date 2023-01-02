The cohort

The cohort and available data included in the study are described in detail in Koivula et al.62,63 and Wesolowska–Andersen and Brorsson et al. (ref. 7). In brief, we used the newly diagnosed sub-cohort of the IMI-DIRECT study consisting of 789 participants. Fifty-eight percent of participants was male and participants had the following characteristics at baseline: age 62 (8.1) years; body mass index 30.5 (5.0) kg m−2; fasting glucose 7.2 (1.4) mmol l−1; 2 h glucose 8.6 (2.8) mmol l−1. Participants were diagnosed within 2 years before recruitment and had glycated hemoglobin (HbA1c) < 60.0 mmol mol−1 (<7.6%) within the previous 3 months. All samples represent distinct individuals. Furthermore, while Wesolowska–Andersen and Brorsson et al.7 used data from baseline and follow up at 18 and 36 months we only used baseline data for modeling. In addition to the baseline data from Wesolowska–Andersen and Brorsson, we carried out extensive curation and harmonization of the medication records included in the electronic case forms by the research nurses in the different recruitment centers and thus used standardized ATC annotated medication data for the individuals (see further detail below). Approval for the study protocol was obtained from each of the regional research ethics review boards separately (Lund, Sweden: 20130312105459927; Copenhagen, Denmark: H-1-2012-166 and H-1-2012-100; Amsterdam, Netherlands: NL40099.029.12; Newcastle, Dundee, and Exeter, UK: 12/NE/0132) and all participants provided written informed consent at enrollment. The research conformed to the ethical principles for medical research involving human participants outlined in the declaration of Helsinki. Further details about the data generation can be found in Wesolowska–Andersen and Brorsson et al.7.

Pre-processing of data

From the clinical, environmental, and questionnaire data only variables with variation across the dataset that were present in at least 10% of the individuals were included. The genomic data was included as the genotypes of risk alleles identified in Mahajan et al.64. In total 393 risk alleles were identified in our cohort out of the 403 associations mentioned in the paper. The genotypes were included as homozygous for risk allele, heterozygote, not having the allele, or missing if the locus was not identified for the individual. Diet data was included as 47 features on self-reported total intake of macronutrients and vitamins across a 24-h period. The wearables measured with an accelerometer included 25 measurements that summarize the movement and heart rate during the day. Transcriptomics data (RNA sequencing) from fasting whole blood samples were processed with RailRNA (v0.2.4b)65 to obtain scaled counts for all samples and only the most variable genes were included. The variable genes were selected by calculating the standard deviation across all individuals for each gene and selecting genes with an above-average standard deviation. Both targeted and untargeted metabolomics data in fasting plasma were included for all measurements passing quality control. In the proteomics data, all measurements within the measurable range based on the OLINK antibody panel were included and residualized for plate layout. The metagenomics data was only available for approximately one-third (256) of the individuals and were included as normalized read counts of identified Metagenomic Species66. Categorical data, including questionnaire responses, drug data, and genomics, was one-hot encoded. The continuous data were residualized by the collection center as the data was collected from six different European countries and, thus, handled by different nurses and lab technicians, as well as differences in the time-of-day samples were taken, which could have a large effect on the measurements. Additionally, the data were residualized for age and sex as these could be biological non-disease-related confounders in the data. Lastly, each continuous dataset was z-scale normalized per feature to ensure that each feature was distributed around zero.

Classification of drugs using the ATC system

The ATC system is the WHO classification system for therapeutic drugs. The system has a hierarchical structure, where the topmost level, ‘level 1—Anatomical main group’, specifies the target organ or tissue, and the lowermost level, ‘level 5—chemical substance’, specifies the active chemical compound. The three levels in between specify the therapeutic, pharmacological, and chemical levels, respectively. We, therefore, mapped all drugs to the lowest possible level to prevent information loss. A total of 4,155 entries could be mapped to level 5. For 55 entries, only a higher-level mapping was possible owing to lack of specificity and 43 entries could not be mapped to the ATC system, either because of the compound not existing in the database, for example nutraceutical compounds, or when we were unable to identify which drug was registered for the participant. The ATC system does not only specify compound names, but also administration route and daily dosages for over half of level 5 entries. However, owing to uncertainty of the reliability of the registered dosages, only drug names and administration routes were used for mapping. In instances where the administration route was not available, the drug was mapped by drug name only.

Drug data collection and clean-up

The study participants were asked to register their current drug usage at screening and baseline. Drug names were registered as free text together with administration route, dosage and frequency, and indication. Metformin was recorded separately from other anti-diabetic and non-anti-diabetic drugs. The collected data was variable in quality, using both generic and brand names, which were in many cases specific to the country of the participant. The data was cleaned in four steps: (1) removal of special characters, company names, formulations, and other non-relevant information; (2) automatic mapping to the PubChem database; (3) manual mapping to generic drug names; and (4) mapping to the ATC system. Indications of placebo use, for example participation in clinical drug trials, were noted as such. Only active compounds were included and consequently, possible brand variation was ignored, including for dietary supplements. Drug combinations were mapped, when possible, to the ATC code specifying said combination. However, when the specificity of the proposed ATC code was less specific than the registered drugs, the drug combinations were mapped to individual ATC codes, that is, ‘Perindopril’ (C09AA04) and ‘Indapamide’ (C03BA11) was used instead of ‘Perindopril and diuretics’ (C09BA04). Entries were mapped to ATC codes with the administration route when possible and otherwise mapped without the administration route. Dosage information was not used in the mapping process. In the manual mapping process, 99.4% of terms were assigned and a total of 359 drugs and drug combinations were identified. A total of 339 drugs (94.4%) was mapped to 441 ATC codes.

Design of the VAE

The VAE framework was constructed to account for a variable number of fully connected hidden layers in both the encoder and decoder and a latent layer that samples from a Gaussian distribution N(0, 1) of two vectors of size N L representing the means, µ, and standard deviations, σ. Each hidden layer included both batch normalization and dropout67 and with leaky rectified linear units (LeakyReLU)68 as activation function. Each dataset was concatenated to one input layer of both categorical and continuous variables. To allow for dataset-specific weights the error calculation was done separately for each dataset. Here we applied cross-entropy loss for categorical data and mean squared error for continuous data as implemented in PyTorch69. The loss was normalized by dataset input size and batch size. Deviance from the Gaussian distribution was penalized by adding the Kullback–Leibler divergence (KLD) to the loss. The final loss was defined as

$$L = \mathbf{W}_{\mathrm{cat}} \times \mathbf{E}_{\mathrm{cat}} + \mathbf{W}_{\mathrm{con}} \times \mathbf{E}_{\mathrm{con}} + \mathbf{W}_{\mathrm{KLD}} \times \mathrm{KLD}$$

Here, E cat and E con are vectors of normalized reconstruction error for each of the continuous and categorical datasets. W cat and W con are vectors as well of the same length as the errors to introduce dataset-specific weights. We applied an equal weight of 1 for all datasets except for continuous clinical data where we used a weight of 2. W KLD is a weight put on the KLD defined as W KLD = β × N L −1 for which we used a β of 0.0001 for the final model. The KLD was defined as

$$\mathrm{KLD} = {\sum} { - \frac{1}{2}(1 + \ln \left( \sigma \right) - \mu ^2 - \sigma )}$$

To efficiently handle missing data for the continuous features we encoded them as mean values across a particular feature during training and excluded the missing data points during back-propagation. With the data being z-score normalized the mean value is represented as zero. For the categorical features, we included them as a zero vector and the ignore index feature in the cross-entropy implementation in PyTorch was used to not include errors for missing data in the back-propagation. The VAE model was trained with the Adam optimizer70, with a mini-batch size of 10 and increasing batch size with a factor of 1.25 during training after every 50 epochs. The number of training epochs was set to 200 on the basis of early stopping on the test set as described below. Additionally, we trained the model using warm-up by first including the full KLD after 10 epochs slowly increasing the weight at epochs 4, 6, and 8. The latent representation of each patient was obtained by passing them through the trained VAE and extracting the µ layer. The VAE was implemented using PyTorch69 (v.1.7.0) and run using a GPU running CUDA (v.10.2.89).

Hyperparameter optimization for multi-omics integration

We initially divided the dataset into training (90%) and test (10%) sets to identify the optimal hyperparameter settings to efficiently capture the data structure without losing the ability to generalize on the test data (Supplementary Figs. 2 and 3). We tested different combinations of sizes of hidden layers, the number of hidden layers, size of latent space, dropout, and weight on the KLD. We then evaluated the model on the basis of both test log-likelihood and reconstruction accuracy. For the number of hidden neurons, the variations used were 200, 500, 800, 1,000, and 1,200, with the number of layers ranging between 1 and 5. The tested latent sizes were between 20 and 400 as well as dropout of 10%, 20%, and 30% and KLD weights of 0.001, 0.0001, and 0.0001. We defined an accurate reconstruction for categorical variables as the class with the highest probability corresponding to the class given by the input. For continuous variables, the accuracy was assessed by comparing the reconstructed array with the input array using cosine similarity for each individual instead of using exact matching. For both categorical and continuous data only non-missing values were used when calculating the accuracy in the reconstruction. We chose the number of training epochs on the basis of when the optimal test likelihood was achieved during testing rounded up to the nearest 100 epochs to ensure sufficient training to learn the complexity of the data. Here we found that more complex models, with higher numbers of hidden neurons and layers, resulted in worse performance on the test set (Supplementary Fig. 2) and that models with more than one hidden layer were unable to provide a decent reconstruction on the training data without overfitting. The only exception was the size of the latent representation, which gave a worse performance with smaller sizes (<50) and equally good performance for larger sizes (from 100 to 400) (Supplementary Fig. 3). For the five best performing models, stability was measured to choose the final model. The stability of the model was evaluated by repeating training with the same hyperparameters and calculating the difference in cosine similarity of the latent space to all other individuals. If the model produced the same result the average change in cosine similarity should be zero. The model with the average change closest to zero was then considered the most stable. The final hyperparameters were set to be one hidden layer of 2,000 neurons, a latent size of at least 100, and a 10% dropout for regularization.

Evaluating feature importance

Feature importance was extracted from the weights of the network for the models with only one hidden layer and because the input data was z-score normalized calculated as

$$I_i = \mathop {\sum }\limits_{j = 1}^{n_{\mathrm{hidden}}} \left| {w_{ij}} \right|$$

where I i is the ith feature input and \(\left| {w_{ij}} \right|\) is the absolute value of the weight from ith input to the jth hidden neuron. To assess the actual impact on the latent representation an adaptation of the SHAP19 analysis was applied. The difference in model performance was assessed as the absolute differences of the latent representation when changing each input to missing for all individuals and passing it through the trained model.

Extracting significant drug associations

Drug associations were extracted by perturbation of the input data after training the final model on all individuals. Thus, for each drug we changed the drug status for all individuals with ‘not receiving’ to ‘receiving’. Importantly, we only included individuals that did not receive the specific drug or another drug within the same therapeutic subgroup (ATC level 2). Then, for each drug change, we compared the change in reconstructions to when we passed the original (un-perturbed) data through the network. In other words we determined the differences that the network infers from the change in drug status that during training was learned from all individuals receiving the drug. We used two strategies for this, one was based on an ensemble of Student’s t-tests using benchmarked thresholds, and another was based on Bayesian decision theory. Both approaches were benchmarked against randomized datasets where all the input data matrices were shuffled on rows and columns. We simulated effects in the shuffled data by randomly sampling a combination of a drug, a multi-omics dataset, and a feature within that omics dataset. For each combination, we then sampled an effect from the standard normal distribution N(0,1) and added this value to the omics feature whenever the selected drug was taken by an individual. We, therefore, did not expect that all effects would be significant in the statistical tests because we sample from N(0,1) and some effects will be close to 0. We added a total of 100 effects to the shuffled data and repeated the entire procedure to generate two shuffled datasets each with their unique added effects. Additionally, we investigated if the number of significant associations, effect size estimates and model uncertainty in the reconstruction were not biased by individual dataset uncertainties. This was done by calculating PCCs between the average estimated effect size across all 20 drugs and the difference between model input and the reconstructions for each of the omics features.

Significant associations using MOVE t-test

To evaluate if the change in the reconstruction was significant, we first determined the expected average change when passing the original and perturbed data through the model ten times. On the basis of these averages, we used a Student’s t-test for related samples as implemented in Python SciPy (v.1.3.1)71 between the baseline and drug-perturbed data for all non-missing continuous data. All P values were subsequently Bonferroni-corrected independently for each drug, and we applied a significance threshold of adjusted P < 0.05. We repeated the entire analysis with retraining of the model 10 times for each of four latent sizes (150, 200, 250, and 300). Associations were only included for analysis if they were significant for at least three of the four latent sizes and in at least five out of ten of the repeats. Therefore, reported P values were the averaged P value across the 10 replicate and 4 model tests, that is a total of 40 two-sided Bonferroni-corrected t-tests. The change in reconstruction, what we report as effect size, was calculated as the average difference across the 10 replicates and 4 model tests and were reported with 95% confidence intervals.

Significant associations using Bayes decision theory

For the method that was based on Bayesian decision theory we used an approach inspired by single-cell variational inference29 and Lopez et al.31. We trained VAE models with a latent size of 150 neurons and benchmarked the approach using different latent sizes and ensembling 1, 5, 10, 20, 30, 35, 40, or 50 models, which we termed refits. For the refits we averaged the reconstructions and used these to obtain the posteriors for the non-perturbed data and each of the drug perturbations. Thus, for VAE ensemble refit i, individual n, feature f, and drug d we define the variational reconstructions as \(\hat x_{infd}\). By averaging across VAE refits, we obtain estimates of the average posteriors \(\hat x_{nfd}\). Then, for each drug d we compare between two models: \(M_d^f\) where feature f is significantly associated with the drug, and the alternative model \(M_0^f\) where feature f is not significantly associated with drug d. Hence, we evaluate how often \(\left| {\hat x_{nfd} - \hat x_{nf0}} \right| > 0\) and calculate Bayes factors (K) as:

$$K = {{{\mathrm{log}}}}_e\left| {\frac{{\mathrm{P}\left( {M_d^f|\hat x_{fd},\,\hat x_{f0}} \right)}}{{\mathrm{P}(M_0^f|\hat x_{fd},\,\hat x_{f0})}}} \right|$$

We ranked the associated features according to K (ref. 72). We set a FDR of α by accepting associations (n) between features and a drug until the cumulative evidence of P(M 0 ) across accepted features for the drug was above the threshold. Since \(\mathrm{P}(M_0^f)=(1-\mathrm{P}(M_d^f))\) we accepted drug-feature associations while the cumulative evidence E is lower than α

$$E = \mathop {\sum }\limits_f \frac{{(1 - \mathrm{P}(M_d^f))}}{n} < \alpha$$

Benchmarking of t-test, MOVE t-test and MOVE Bayes

To be able to compare the number of significant associations between methods we used the two randomized datasets to estimate FDR from the ground truth, that is the added drug–omics effects (Supplementary Table 3). Here we found that a t-test with Benjamini–Hochberg FDR of 0.01 had ground-truth FDR of 0.00 and 0.06 on the two randomized datasets, corresponding to 52 and 67 true positives as well as 0 and 4 false positives, respectively. For MOVE t-test, we benchmarked the number of refits of the 4 models and found 10 refits to have a ground-truth FDR of 0.02 and 0.06, with 48 and 61 true positives as well as 1 and 3 false positives, respectively. For MOVE Bayes we benchmarked the number of refits for a model with 150 latent neurons and found FDR from the cumulative evidence to be well aligned with FDR of the ground truth. Using Bayes FDR of 0.05 we found 30 refits to have ground-truth FDR of 0.02 and 0.05, respectively. Across the two shuffled datasets 42 and 59 true positives were found by all three methods (Supplementary Fig. 12).

Calculation of drug associations using other methods

We compared our findings to associations identified with standard statistical approaches using Student’s t-test for unrelated samples and an ANOVA between two groups of individuals ‘not receiving’ and ‘receiving’ each drug. Here we used Benjamini–Hochberg correction for FDR73 with an adjusted P < 0.01. Additionally, we tested if a least absolute shrinkage and selection operator (LASSO) model was able to identify features with significant impact on predicting the ‘not receiving’ or ‘receiving’ groups for each drug. However, the LASSO model was unable to converge possibly owing to the high input feature dimensionality. All statistical tests were done with Python SciPy (v.1.3.1)71.

Drug effect size and similarities across omics data

Drug effect sizes were determined as the difference between the baseline and drug-perturbed variational reconstructions, that is, as the average difference across the VAE ensemble refits reported with 95% confidence intervals. Drug similarities were calculated as the cosine similarity as implemented in Python SciPy (v.1.3.1)71 between the average effect sizes on all features identified as significantly associated for at least one of the drugs both across and within each dataset. The difference was only calculated for non-missing data and individuals not already on the drug or a drug in the same ACT group. The rank of drug effect sizes was determined for each omics dataset ranking the effect sizes from 1 to 20. A rank of 20 indicates that the drug had the highest average effect size in this omics dataset compared to the other drugs. Correlations between multi-omics profiles and number of individuals taking the drug pair were calculated from the fraction of individuals that overlapped between the two drugs.

Molecular-focused analysis of the multi-omics data

To get a better understanding of the molecular profiles identified in the associations for the transcriptomics and proteomics data we tested for enriched Gene Ontology terms as well as molecular pathways. For the transcriptomics data, we assessed the molecular patterns of biological processes and pathways from Reactome74 (v.3.7) using the significantly associated genes for each drug against a background list of all genes included in the data integration. We used WebGestaltR75 (v.0.4.4) for the analysis with default settings (hypergeometric test) and evaluated all results with an FDR < 0.05. The targeted metabolomics data was analyzed for potential metabolite enrichments using MetaboAnalyst76 (v.5) over-representation analysis using a hypergeometric test and FDR of 0.05. We investigated both enrichments in known pathways in the KEGG database as well as enrichment of chemical structures sub-, main- and super-class levels. For all analyses, we used the included panel of targeted metabolites as the reference data.

Association differences within diabetes archetypes

As mentioned, previous work by Wesolowska–Andersen and Brorsson et al. performed archetype analysis of the multi-omics data with only metformin medication data7. Here they based the archetypes on clinical markers and identified four distinct and one ‘mixed’ T2D archetypes with clinical and omics profiles. To investigate if these distinct archetypes differed in their drug associations we used a t-test on the average effect size change for the individuals of each archetype against the remaining individuals. The analysis was only done for the significant drug associations for each drug. All analysis was only done for individuals not taking the drug or a drug within the same ATC therapeutical class similarly to the main analysis.

Drug–drug interactions

We used an in-house drug–drug interaction compendium generated from publicly available sources (Supplementary Table 11) to assess whether drug combinations had been reported previously to be interacting or not77. The compendium contains interactions from 26 different datasets of pharmacovigilance, clinically oriented information, schemas for NLP corpora, and drug–Cytochrome P450 relationships sources. For 12 of the drug–drug pairs in our dataset we could identify drug–drug interactions with reported severity (major, moderate, minor, possible, undetermined, and none) indicating clinical significance.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.