Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues

Tissue heterogeneity is both a major confounding factor and an underexploited information source. While a handful of reports have demonstrated the potential of supervised computational methods to deconvolute tissue heterogeneity, these approaches require a priori information on the marker genes or composition of known subpopulations. To address the critical problem of the absence of validated marker genes for many (including novel) subpopulations, we describe convex analysis of mixtures (CAM), a fully unsupervised in silico method, for identifying subpopulation marker genes directly from the original mixed gene expressions in scatter space that can improve molecular analyses in many biological contexts. Validated with predesigned mixtures, CAM on the gene expression data from peripheral leukocytes, brain tissue, and yeast cell cycle, revealed novel marker genes that were otherwise undetectable using existing methods. Importantly, CAM requires no a priori information on the number, identity, or composition of the subpopulations present in mixed samples, and does not require the presence of pure subpopulations in sample space. This advantage is significant in that CAM can achieve all of its goals using only a small number of heterogeneous samples, and is more powerful to distinguish between phenotypically similar subpopulations.


Supplementary Figure legends
Supplementary Figure 1: Unreliability of the supervised method on GSE19830 in the presence of uncertainties in measured subpopulation proportions 2 . The constituent proportions were randomly perturbed with ±0.05~0.3 or one of the ten mixtures was replaced with arbitrary constituent proportions. If the proportion is less than 0, it is set to 0. After perturbation, each row sum of the mixing matrix was normalized to 1. (a-c) The estimated liver/brain/lungspecific expression profiles were compared with the ground truth using correlation coefficients estimated over all genes, where the constituent proportions were randomly perturbed with ±0.05~0.3. The experiment was repeated 100 times to reflect true performance with each perturbation rate.
(d-f) The estimated liver/brain/lung-specific expression profiles are compared with ground truth using correlation coefficient over marker genes, where the constituent proportions are randomly perturbed with ±0.05~0.3; and to reflect the true performance, under each perturbation rate, the experiment is repeated 100 times. (g-i) Boxplot of the correlation coefficients between the estimated liver/brain/lung-specific expression profiles and ground truth over all genes, where one of the ten mixtures is replaced with arbitrary constituent proportions and the experiment is repeated 500 times. (j-l) The Boxplots of correlation coefficients between the estimated liver/brain/lung-specific expression profiles and ground truth over marker genes, where one of the ten mixtures is replaced with arbitrary constituent proportions and the experiment is repeated 500 times.

Supplementary Table legends
Supplementary  Supplementary Table 3: CAM validation on GSE11058 -comparison with peer method 1 . Despite being completely unsupervised method, CAM even slightly outperformed the supervised method that requires known reference marker gene signatures to operate.

Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues
Then, for any Conversely, for any , we have that then implies Users can also choose other methods to reduce sample dimension before performing CAM.
One simple approach is to randomly select K+ samples for CAM, although this method cannot take advantage of all information from the complete sample set. Alternatively, one can group all samples into sample clusters, and use the sample cluster centers as pseudo samples to perform CAM. Other strategies, such as Nonnegative matrix factorization (NMF) or random projection can also be used to reduce the sample dimension. that are required to encode/explain both the 'data' and 'model' 12,13 . When the model is given Specifically, when estimating the mixing matrix (i.e., the column vectors ak's) that is parameterized by (K-1)J independent entries, we use some form of vector-average operation (i.e., To test whether the pure subpopulation expression profiles are dependent, we calculate the cross-correlation coefficients between different sources. We can see from Supplementary To assess whether the performance of our unsupervised method CAM is comparable to that achieved by the supervised methods, we compare the results obtained using CAM (without using any form of a priori information, Supplementary Fig. S3) and the results obtained using correct supervised information. From Supplementary Tables S4e and S4f, we can see that CAM not only performs comparably with supervised method and can modestly outperforms these methods. It is important to note that sample dimension reduction is conducted before data deconvolution. Provided that the loss of the observed information is minimal, the separability of patterns is not compromised and is ideally handled by the subsequent signal deconvolution process. The extensive simulation studies that investigated the information loss by eigenvalue decomposition in a Gaussian noise environment support this expectation.

Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues
Marker gene concept. Subpopulation-specific marker (or marker gene) is a widely accepted and adopted concept [1][2][3][4][5][6][7][8] . Subpopulation-specific markers are defined as the markers whose expression values are primarily enriched in a specific subpopulation 1,4,6,9 . Concerning the key assumption on the existence of subpopulation-specific markers for the unsupervised deconvolution approach, the justification is twofold. First, the assumption on the coexistence of distinct subpopulations and subpopulation-specific marker genes is logical, since the definition of a distinct subpopulation would require a set of MGs that are 'unique' to that subpopulation. Hence, a subpopulation may not be considered 'distinct' if it does not have associated unique MGs. Second, since in our newly proved theorems, the coexistence of distinct subpopulations and subpopulation-specific marker genes is proven to be both a sufficient and a necessary condition for a successful deconvolution. Thus, if the assumption on the coexistence of distinct subpopulations and subpopulation-specific marker genes is not true, a successful deconvolution of mixed expressions (validated by the ground truth) is not possible.
If an accurate deconvolution of mixed expressions is successfully achieved, the coexistence of distinct subpopulations and subpopulation-specific marker genes would be guaranteed.
The indices and expression levels of markers are condition dependent. Thus, unsupervised methods for detecting condition-dependent and subpopulation-specific markers are preferred over supervised methods. For example, since expression data are often noisy and the relevant literature is often incomplete, simply using a priori markers may not provide accurate information for data deconvolution 3 . Moreover, from several lines of evidence including our own work in Novartis with sorted cells 10  Furthermore, CD15, CD4, and CD56 cells also express the CD14 epitope. This is not a technical artifact but inherent to using CD14 as a "specific" marker for monocytes. Kuhn et al.  Fig. 4b, which shows the pattern of exclusive enrichment in a particular subpopulations, we can see the evidence for the existence of subpopulation-specific markers and for CAM's ability to identify them blindly and correctly.
In contrast, when using a supervised method to analyze heterogeneous tissue, the user is required to have some a priori knowledge regarding the nature of the tissue and its possible subpopulation constituents. Usually, purified signatures of the candidate subpopulations are found in public repositories 8 , even though they may not be acquired under the same condition as the dataset now being interrogated 3  where cdc28-13 cells were collected at 17 time points taken at 10 min intervals, covering nearly two full cell cycles (synchronous yeast cells growing in G1, S, G2, M, and M/G1 phases). There are 6,208 transcripts in total. In identifying cell cycle relevant genes, a comparative study reported by de Lichtenberg et al. 13 reveals that Spellman's method 14 outperforms other methods in identifying periodically expressed genes. In Spellman's paper, Fourier transformation was used to assign a CDC score to each gene, and genes with CDC score higher than a threshold were identified as cell cycle regulated genes. The authors required the CDC score threshold to be exceeded by 90% of known cell cycle regulated genes (104 genes verified by traditional method), and accordingly, 800 genes were identified as cell cycle regulated by this method. The 800 genes were identified using three datasets: alpha factor, CDC15 and CDC28. Our experiments use CDC28, because the alpha factor or CDC15 contains too many missing values. In the 800 genes, 113 genes belong to M/G1 phase; 300 genes belong to G1 phase; 121 genes belong to G2 phase; 71 genes belong to S phase; 195 genes belong to M phase. Among the 800 genes selected by Spellman et al. 14