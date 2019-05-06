Sensitivity analysis of Scanorama alignment parameters and t-SNE visualization parameters for the integration of 26 diverse scRNA-seq datasets. Box plots show distributions of Silhouette Coefficients at different parameter settings. Box plot boxes extend from lower to upper quartiles with an orange line at the median and green triangle at the mean; whiskers show the range. All distributions are over the same 105,476 cells across 26 heterogeneous scRNA-seq datasets. An asterisk (*) indicates a significantly higher Silhouette Coefficient distribution (two-sided independent t-test, Bonferroni corrected P < 0.05) between Scanorama and no correction, a dagger (†) indicates significance over scran MNN, and a double dagger (‡) indicates significance over Seurat CCA. Importantly, in the analysis for alignment parameters (a-d), Silhouette Coefficients are calculated for the integrated, low-dimensional embeddings. When assessing the sensitivity of t-SNE visualization parameters (e, f), we calculate the Silhouette Coefficients on the 2-dimensional t-SNE embeddings (which are computed off of the low dimensional embeddings). All plots also include the Silhouette Coefficient distributions for uncorrected data, Seurat CCA integration, and scran MNN correction on low dimensional embeddings as described in Methods. (a) The k nearest neighbor parameter is largely insensitive around the default value of 20 and can go as low as 5 without affecting performance. At larger values of k, the matches become more permissive and the Silhouette Coefficients start to drop, where at k = 100 the median Silhouette Coefficient (0.091) is below that of the uncorrected case. (b) There is no significant change in the distribution of Silhouette Coefficients between the approximate and exact nearest neighbors settings (independent, two-sided t-test P = 0.39; n = 105,476 cells), although the integration runtime increases to more than 60 minutes without the approximation algorithm. (c) We recommend keeping α to a low value greater than zero, which can be learned from the data if some of the cell types being integrated are known. Lower values may introduce overcorrection, while higher values approach the uncorrected case. (d) The median Silhouette Coefficient is largely insensitive to different values of the smoothing parameter σ for the Gaussian kernel function. (e) Visualizing the integration of 26 datasets requires a high perplexity (around 500 or greater) to obtain a median Silhouette Coefficient comparable to that for the low dimensional embeddings. We set the perplexity to 1,200 for visualizing the 26 datasets (Fig. 3a). (f) When visualizing the 26 datasets, a higher t-SNE learning rate improves the median Silhouette Coefficient to be comparable that for the low dimensional embeddings. The Silhouette Coefficient distributions for the t-SNE embeddings are generally wider than those for the lower dimensional embeddings since it is harder to obtain large separations between clusters in two dimensions.