Parameters used in simulation were listed in Supplementary Table 1. (a) Summary of data and analysis used in this work. (b) A flowchart of MUSE analysis pipeline. (c) Simulation design (Methods) to generate sample profiles with two modalities used for (d-s) below. (d) tSNE visualizations of latent representations from single- and combined-modality methods for randomly selected simulation experiments in Fig. 1c. Colors: ground-truth subpopulation labels in simulation. (e) Evaluation of combined methods in simulated data with different ground-truth cluster numbers. n = 1,000 (top) and 3,000 (bottom) samples were considered in simulations. (Note: for n = 1,000 and cluster number ≥30, each cluster may only contain a small number of samples.) (f) Evaluation of multi-modal methods in simulated data with Gaussian noise for increasing variance (σ). (g) Clustering accuracies for (i) analyses of concatenated modality features using various normalization approaches (Methods), and (ii) MUSE multi-modal analysis on matched or unmatched (randomly permuted sample order on one modality) data. ARIs were calculated based on n = 10 repeats. Boxplot: center line, median; box, interquartile range; whiskers, minimum–maximum range; same annotation also applies to other boxplots in this figure. (h) Example t-SNE visualization of MUSE subpopulations (indicated by shapes) and simple superimposition of single-modality clusters (indicated by colors) with simulation parameters chosen as in (f). (i) Simulation design using real morphological features from STARmap (Methods; dataset details were described in Fig. 3) and performances of multimodal methods (right n = 10). (j) Multimodal analysis on data with homogeneous features in one modality. Transcript profiles (left) were generated from a normal distribution while morphological features (middle) were simulated from known subpopulations as before. (k) Evaluation of clustering accuracy under different dimensions of joint latent representations (n = 10). (l) Clustering accuracy of MUSE while changing dimension of morphological features between 100 to 1,000 (n = 10). (m) Clustering accuracy of MUSE when fixing the latent representation of single modality (h x , h y ) to different dimensions. ARIs were averaged on 10 repeats. Red underlines : parameters selected as default. (n) Effects of clustering methods on accuracies (n = 10). Cluster numbers for hierarchical and Kmeans methods were chosen using the elbow method with distortion score. (o) Run times for compared methods on simulated data; n = 1,000 cells. Note: for fair comparison, all methods were run under CPU mode. (p) Run times of MUSE on datasets with larger sample sizes using different clustering methods in label updating during training. (q) Accuracies and run times when fixing single modality labels (denoted as l x and l y in Methods) to the initial labels in training. Each dot represented one independent experiment. (r) Model structure of multi-modal autoencoder used in MUSE. (s) Performance evaluation of MUSE with different hyperparameter settings (n = 10): 1) weight of regularization term; 2) weight of supervision term; 3) learning rate; and 4) iteration intervals between cluster updating in training. Red underlines : parameters selected as default in MUSE package. (t) F-norms of selective matrices w x and w y to different true cluster numbers (left) in data and choices of regularization hyperparameter \(\lambda _{{{{\mathrm{regularization}}}}}\) (right); n = 10. (u) Clustering accuracies (left) and number of clusters (right) from PhenoGraph when change the hyperparameter of n_neighbor (n = 10).