Using transfer learning from prior reference knowledge to improve the clustering of single-cell RNA-Seq data

In many research areas scientists are interested in clustering objects within small datasets while making use of prior knowledge from large reference datasets. We propose a method to apply the machine learning concept of transfer learning to unsupervised clustering problems and show its effectiveness in the field of single-cell RNA sequencing (scRNA-Seq). The goal of scRNA-Seq experiments is often the definition and cataloguing of cell types from the transcriptional output of individual cells. To improve the clustering of small disease- or tissue-specific datasets, for which the identification of rare cell types is often problematic, we propose a transfer learning method to utilize large and well-annotated reference datasets, such as those produced by the Human Cell Atlas. Our approach modifies the dataset of interest while incorporating key information from the larger reference dataset via Non-negative Matrix Factorization (NMF). The modified dataset is subsequently provided to a clustering algorithm. We empirically evaluate the benefits of our approach on simulated scRNA-Seq data as well as on publicly available datasets. Finally, we present results for the analysis of a recently published small dataset and find improved clustering when transferring knowledge from a large reference dataset. Implementations of the method are available at https://github.com/nicococo/scRNA.


Supplementary Methods
Given a well-known source dataset with clustering labels, the proposed method improves the clustering of an unlabeled target dataset by transferring knowledge from source to target data via Non-Negative Matrix Factorization (NMF) 1,2 . The modified target dataset can then be provided to any kind of clustering algorithm. In this work we are using SC3 3 as an example.
The following sections describe the exemplary clustering algorithm, SC3, in more detail and present a visualization of the baseline methods that the transfer learning approach is compared to.  4 to compute a consensus matrix. Hierarchical clustering is finally used to cluster the resulting matrix into k clusters.

SC3: Consensus clustering of single-cell RNA-seq data
Supplementary Figure S1: The SC3 framework for consensus clustering (Kiselev 2017

Baseline methods
For assessing the quality of our unsupervised transfer learning solution, we are interested in investigating the clustering accuracy of our method on a target dataset compared to two competitor methods. As baseline methods we implement the original SC3 clustering method on the target dataset alone (TargetCluster) and on the concatenated dataset of source and Clustering is applied to the target dataset alone. B ConcatenateCluster. Source dataset and target dataset are combined into one large dataset via simple concatenation before clustering the new dataset as a whole.
Performance measures (i.e. accuracy) is calculated on the target dataset only, since it is the main focus of interest for clustering. C TransferCluster. The proposed method of knowledge transfer is applied to the target dataset learning from a large labeled source dataset. The resulting, modified target dataset is then provided to the clustering procedure.

Pre-processing
The pre-processing steps were not applied to the generated datasets, because the generation process did not produce any unfavourable genes or cells.

Parameter selection
For each overlap setting (described in the Method Section of the main text) 100 datasets of 1000 source cells and 800 target cells were generated. The datasets consisted of simulated count data of 10.000 genes. All three competitor methods were applied to down sampled target datasets where for each repetition 10, 25, 50, 100, 200, 400, 600 and 800 were randomly selected from the complete target dataset.
The generated datasets where used to determine performance changes induced by varying the free parameters of the method and identify optimal settings which were assumed to be good choices for the application of the proposed method to real datasets. Here, we present the chosen values of the free parameters of the TransferCluster method which were mostly used for the investigation of

Results and mixture parameter selection via KTA scores
The mixture parameter θ dictates how much the newly constructed target dataset should be The simulation study on generated scRNA-Seq data was used to investigate the performance of this parameter selection procedure.
Supplementary Figure S4 gives insight into the procedure within TransferCluster that automatically selects the mixture parameter θ based on KTA scores. The first row of performance plots shows the original results on the generated datasets which can also be found in the main text of the article.

Ground truth cluster memberships
The transfer learning approach and its baselines were investigated under two different conditions. Firstly, we assumed that no ground truth labels were available and generated labels for 18 cell clusters via NMF clustering 1,2 on the whole dataset (Supplementary Table   S1). As it is based on the totality of the data we interpret this clustering as a ground truth clustering and apply our method and the baseline algorithms to a subset of the dataset, to see how each method performs relative to this definition of ground truth when not all of the data is available. For the source dataset those labels are put into the TransferCluster procedure. For the validation of all methods the target labels are used for measuring accuracy. Secondly, we use the data-driven clustering labels provided in the original paper and take those as the ground truth labels. Specifically, we use a cut-off point in the provided clustering hierarchy that results in 18 clusters (Supplementary Table S2). Given those alternative ground truth labels, we once again run TargetCluster, ConcatenateCluster and TransferCluster on the Tasic data.

Pre-processing
Before pre-processing, the original Hockley 6 dataset contained 314 cells and 45513 genes.
The parameters of the pre-processing filters described in the Method Section of the main text were set to xgenes =2000, xexpression =1, xcells = 94 after inspection of the expression histogram in

Parameter selection
The free parameters in the NMF step of the method were chosen according to the best results in the controlled environment of the generated datasets, i.e. α = 10.0 and λ = 0.75 and the maximum number of iterations until convergence up to a relative error of 0.001 was set to 4000. The number of clusters to be put into the different clustering methods was k = 7, which was the number of cell types identified in the original Hockley publication.
The mixture parameter θ is again selected automatically via the KTA score selection process (See Methods Section in the main paper). In Supplementary Figure S8 we present the KTA scores for a range of θ between 0 (meaning no mixture, i.e. no transfer learning) and 1 (meaning full mixture) and note that high θ values are to be avoided and taking lower θ is to be preferred. The maximal KTA score is obtained for θ = 0.7 which is the value that is consequently chosen by the automatic procedure. These findings indicate that the proposed transfer learning method was able to identify relatedness but also differences in the two datasets by automatically choosing a mixture parameter that lies in the middle of the range of possible values of θ. This is in accordance with the fact that the source and target datasets are completely independent, but biologically related, datasets, collected at different times and places.

Source cluster memberships
As with the Tasic dataset we first analyzed the Hockley data pretending no reliable source labels for the Usoskin dataset were available and generated them via NMF Clustering. We assumed a complete overlap between the cell types in source and target data and chose the number of clusters to be k = 7 for the source label generation. See Supplementary Table S3 for the corresponding cell counts of each cluster.