Convex Analysis of Mixtures for Separating Non-negative Well-grounded Sources

Blind Source Separation (BSS) is a powerful tool for analyzing composite data patterns in many areas, such as computational biology. We introduce a novel BSS method, Convex Analysis of Mixtures (CAM), for separating non-negative well-grounded sources, which learns the mixing matrix by identifying the lateral edges of the convex data scatter plot. We propose and prove a sufficient and necessary condition for identifying the mixing matrix through edge detection in the noise-free case, which enables CAM to identify the mixing matrix not only in the exact-determined and over-determined scenarios, but also in the under-determined scenario. We show the optimality of the edge detection strategy, even for cases where source well-groundedness is not strictly satisfied. The CAM algorithm integrates plug-in noise filtering using sector-based clustering, an efficient geometric convex analysis scheme, and stability-based model order selection. The superior performance of CAM against a panel of benchmark BSS techniques is demonstrated on numerically mixed gene expression data of ovarian cancer subtypes. We apply CAM to dissect dynamic contrast-enhanced magnetic resonance imaging data taken from breast tumors and time-course microarray gene expression data derived from in-vivo muscle regeneration in mice, both producing biologically plausible decomposition results.

A is the unknown M × K mixing matrix, and S is the unknown K × N source data matrix containing K source signals with N dimensions.The fundamental objective of BSS is to estimate both the unknown mixing proportions and the source signals based only on the observed mixtures.
Over the past fifteen years, a variety of BSS techniques have been continuously reported and tested on synthetic and real data, including Independent Component Analysis (ICA) and its variants, which assume sources are mutually statistically independent or uncorrelated [5][6][7], and Nonnegative Matrix Factorization (NMF) and its variants, which assume mixing proportions and sources are non-negative [1,8,9].NMF is known to have non-unique solutions and can be trapped in a local optimum of its objective function [9].Efforts, such as the incorporation of sparsity constraints [8,9], have been made to obtain more well-posed problems under the NMF framework [4,[9][10][11].Some other extensions of NMF include relaxation on the signs of the matrix factorization.Semi-NMF allows the mixing matrix to have mixed signs and convex-NMF further requires column vectors in A to be convex combinations of data points in X [12].While these algorithms can usefully extract interesting patterns from mixture observations, they may prove inaccurate or even incorrect in the face of real-world BSS problems, where their pre-imposed assumptions may not be valid.In particular, many source signals are statistically dependent and may not be sparse [2,3].
Alternative BSS techniques exploit Well-Grounded Points (WGPs) in non-negative source patterns, i.e. points with very high values in one source relative to all other sources [3,4,13].Under the assumption of WGPs, column vectors of the mixing matrix can be estimated by identifying WGPs located at the corners of the mixture observation scatter simplex and, subsequently, the hidden source signals can be recovered.N-FINDR is one of the earliest methods based on WGPs and identifies WGPs by searching for the maximum-volume simplex formed by the data points [14].Vertex Component Analysis (VCA) implements a fast WGP detection scheme by iteratively projecting data onto a direction orthogonal to the subspace spanned by the WGPs already determined and selecting the data point corresponding to the most extreme projection as the next WGP [15].The maximum-volume strategy has also been applied in the signal space by nonnegative least-correlated component analysis for recovering well-grounded sources [4].For cases where WGPs are absent but nearly pure-source data points exist, a constrained NMF method considering both the reconstruction error and the minimization of the simplex volume determined by the estimated mixing matrix column vectors has been proposed [16].A post-processing framework on the results obtained by a WGP-based solution has been developed using either extra mixture data or reliable peak structures of source signals, also for the situations where WGPs are absent [17].
However, there are several potential limitations associated with these techniques.First, they usually lack a theoretical proof of model identifiability and solution optimality [3].Many methods adopt the strategy of identifying WGPs without a stringent mathematical framework showing its validity.Second, many methods can be applied only to the exact-determined and over-determined cases, where the number of mixtures is no less than the number of sources, but not to the under-determined case, where there are more sources than mixtures.Third, their solutions (including model selection) may be sensitive to noise and outliers in the data.Fourth, some methods do not allow negative elements in the mixing matrix, which limits their applicability.
Based on the realization that the observed pattern across signal indices at each data point can be expressed as a nonnegative combination of the column vectors of the mixing matrix [18], we propose a Convex Analysis of Mixtures (CAM) method to estimate the mixing proportions by explicitly identifying WGPs at the lateral edges of the clustered observation scatter plot.CAM is theoretically supported by a series of newly proved identifiability and optimality theorems.A sufficient and necessary condition is discovered for identifying the mixing matrix through edge detection in non-negative well-grounded BSS problems, which also serves as the foundation for CAM to be applied to the under-determined case, in addition to the exact-determined and over-determined cases.The optimality of the edge identification strategy is also proved for non-negative BSS problems, even when WGPs do not exist.
The CAM algorithm integrates a plug-in noise and outlier filtering scheme, an edge detection and geometric convex analysis algorithm, and a model selection scheme for applications on noisy real-world problems.We first design a sector-based clustering scheme, used to obtain an effective noise and outlier-reduced, clustered representation of the data.We then develop an efficient lateral edge detection and geometric convex analysis algorithm that identifies the WGPassociated clusters, whose center vectors are the estimates for the column vectors of the mixing matrix.The algorithm proceeds to estimate source signals by non-negative leastsquares fitting of the latent variable model to the observation data, where the number of hidden sources is detected using a stability analysis scheme.Importantly, CAM operates in the scatter space of mixture signals, and hence can address highdimensional data deconvolution where no observed mixture signal is actually a pure source signal in the signal space.This advantage is significant in that CAM can achieve its goal using only a small number of mixture samples.Moreover, CAM is more powerful in distinguishing between similar sources because it exploits mixing diversity in the scatter space rather than source diversity in the signal space.
We demonstrate the principle and feasibility of the CAM approach on realistic synthetic data involving images and microarray gene expression profiles, and experimentally compare the accuracy of parameter estimates obtained using CAM to the most relevant alternative techniques.We then use the algorithm to dissect dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) data taken from breast tumors, identifying vascular compartments with distinct pharmacokinetics and revealing intratumor vascular heterogeneity.We also apply CAM to time-course gene expression data derived from in-vivo muscle regeneration in mice, observing biologically plausible dynamic patterns of relevant biological processes with distinct kinetics and phenotype-specific gene expression patterns.Finally, extensions of the CAM algorithm and the relationships to other approaches are discussed.

A. Assumptions of the CAM Model
Considering the linear latent variable model X = AS, we can re-express the model in vector-matrix notation 1 where , , and are column vectors of matrices X, A, and S, respectively.Such a linear latent variable model is widely applicable to the analysis of many types of data, with the interpretation of the mixtures and underlying sources application-dependent.As a generic example for now, one can consider image unmixing, with M observed N-pixel images, each a mixture of K source images.
Our CAM model is developed based on the following assumptions.
(A1) (Source non-negativity) Every element in S takes a non-negative value and S has full row rank.
(A2) (Sources well-grounded) The source data matrix S contains at least one WGP on each of the K coordinate axes, i.e. , such that , , where is the standard basis of Kdimensional real space.
(A3) Every column vector in A is neither a non-negative nor a non-positive linear combination of other column vectors in A.
(A4) A is of full column rank, i.e. .
(A1) is widely satisfied in many real-world applications, such as image mixtures and mixtures of biochemical molecular quantities [2,3].(A2) exploits the high-contrast segments between source signals, which often exist in many real-world problems [3,4].One such example is unmixing of multispectral images in remote sensing, where WGPs correspond to "endmember" pixels.From (A1) and Equation (1), we have where is the kth element of .When the source matrix S satisfies (A1) and (A2), i.e. it is a non-negative well-grounded BSS problem, (A3) is a necessary and sufficient condition for the mixing matrix A to be identified, as we will prove through a set of theorems later.(A4) is a sufficient condition for identifying the source matrix S, when (A1) and (A2) hold, and is widely used in many BSS problems [5].Apparently, (A3) is a necessary but not a sufficient condition for (A4), in other words, (A3) is guaranteed to hold if (A4) is satisfied, but not vice versa.Also, importantly, (A3) can hold not only in the exact-determined and over-determined cases, but also in the under-determined cases, where there are at least three mixtures, i.e.
. (A4) on the contrary can be satisfied only in the exact-determined and over-determined cases.(A3) gives CAM the potential to be applied to the under-determined case.

B. Identifiability of the CAM Model
We now discuss the identifiability of the CAM model under the aforementioned assumptions via the following definitions and theorems (see formal proofs in Appendices A and B). ) and z can only be expressed as a trivial combination of (i.e. if for some q, then , ).See Fig. 1 for illustrations of a convex cone and its lateral edges.Because for edges only the vector direction is of interest, edges with the same vector direction but different lengths will be considered identical in the sequel.With the concept of convex cone, the model assumption (A3) can be , it is also a lateral edge of cone  .This means that, in principle, under a noise-free scenario, we can directly recover by locating the lateral edges of  , up to the ambiguity of positive scaling.If (A4) is satisfied, the source data matrix S can then be recovered by the generalized inverse of A, which is S under a noise-free scenario [5].If (A4) is not satisfied and only (A3) is satisfied, S might not be recoverable.
We summarize the identifiability of the CAM model as follows: (1) If (A1), (A2), and (A4) are satisfied, which can happen only in the exact-determined and over-determined cases, both A and S are identifiable.(2) If (A1), (A2), and (A3) are satisfied (which can happen not only in the exact-determined and over-determined cases, but also in the under-determined case where there are at least three mixtures), the mixing matrix A and the number of sources are identifiable.S is usually not identifiable if (A4) is not satisfied.

C. Detectability of the lateral edges of cone 
One key step in developing the CAM algorithm is efficient detection of the lateral edges of  .Here we discuss the algorithmic principle and optimality of our CAM solution via the following definition and theorems (see formal proofs in Appendices C and D). .We also define the angle between a non-zero vector and a zero vector to equal 8 ˚, i.e. 8 .See Fig. 1 for an illustration of projecting a data point onto a convex cone and the corresponding projection angle.The optimization problem in Equation ( 4) is a second order cone programming problem that can be solved by existing algorithms [19]., .Note that Theorem 2 assumes each data vector has a unique direction.This can be easily satisfied in practice by retaining in only one data vector from each group of vectors that are positive scalings of each other (i.e., which lie in the same direction).
An important consideration for the present method is that it requires a WGP to exist for each of the underlying sources.While this is both a reasonable assumption in practice and serves to establish mathematical identifiability of the CAM model, nevertheless in some datasets, WGPs may not exist.It would be helpful to provide an accurate interpretation of the CAM solution in such non-ideal scenarios.Accordingly we show that, even if WGPs do not exist, CAM edge detection provides the optimal solution in the sense of capturing maximum source information, because when WGPs are not present, the data vectors which instead achieve Maximum Source Dominance (MSD) for each of the sources are the lateral edges of  and will be identified by CAM.Specifically, we have: Theorem 3 (Source dominance optimality).Suppose that (A1) and (A4) hold.For each source k, , the CAM solution identifies at least one lateral edge, denoted by , achieving the maximum source dominance in the sense of where T , satisfying , is the source vector of sample n following a normalization operation applied to the observed data matrix.
Please see Appendix D for the proof of Theorem 3 and for the details of the normalization on the observed data matrix so that source vectors corresponding to different data points are comparable.

III. CAM ALGORITHM
So far, we have developed a mathematical CAM framework for separating non-negative well-grounded sources under an ideal noise-free situation.In this section, we develop a practical CAM algorithm that is based on this framework but which also robustly addresses the realistic scenarios where there may be both noise and outliers present.This algorithm consists of data preprocessing, sector-based clustering, convex analysis of mixtures, stability analysis, and source pattern recovery.We first summarize the steps of the CAM algorithm and then explain each of these steps in the following subsections.

CAM Algorithm
(1) Data preprocessing to normalize data and remove data points with small vector norms that potentially have low local SNR.
(2) Sector-based clustering on the scatter plot to get a noisereduced representation of the data.
(3) Convex analysis of mixtures for estimating the mixing matrix, including (i) edge detection based on sector central rays to form a candidate pool of estimates for the mixing matrix column vectors and (ii) minimization of model fitting error to produce an estimate for the mixing matrix with a given source number.
(4) Determination of source number by stability analysis, which repeats steps (2) and (3) for different source numbers based on random partitions of the data to calculate the normalized model instability of each candidate source number.The best source number is selected as the one with the smallest instability.

A. Data Preprocessing
Our algorithm begins with two data preprocessing steps.First, we scale the observed mixtures to have unit sums and assume the underlying sources also have unit sums as done in [13], i.e. after scaling, , , and , .Note that this scaling makes each row of A have unit sum so that the mixing matrix elements provide the mixing proportions, i.e. , , where is the mth element of .Note that this step removes the scale ambiguity of the BSS solution [6].Second, consider the following noisy linear latent variable model where is the additive noise on sample n and is independent of .We assume that and define the Signal-to-Noise Ratio (SNR) of the whole dataset as Since the expected noise level for all data points is the same, data points with small vector norms are expected to have a lower local SNR, which could have a negative impact on subsequent analysis [3], so the second step of data preprocessing is to remove these small norm points.

B. Noise or Outlier Removal by Sector-based Clustering
The purpose of sector-based clustering on the preprocessed data points is two-fold: 1) data clustering has proven to be an r model learning [3,20]; and 2) aggregation of data points into a (smaller) number of clusters improves the computational efficiency of subsequent convex analysis of mixtures by reducing the number of tests performed for identifying lateral edges.After sector-based clustering, each data sector (cluster) is represented by a ray starting from the origin, which is called a sector central ray.Please see Fig. 2 for illustration of sectorbased clustering.The distance between a data point and a ray is the minimum distance between the data point and any point on the ray.Sector-based clustering groups data points into sectors (each with its own central ray) so that data points within a sector have more similar orientations (evaluated by their angles made with the central ray) compared to data points in other sectors.Assuming a sufficient number of sectors are used to model the data, we logically impose , , , where J is the number of data sectors and denotes the jth data sector.Since only the vector direction is of importance, the sector central rays are confined to have unit norm, i.e. , .Based on Definition 4, the sector central ray is mathematically defined as By expanding the square in the summation and simplifying, we can show that where is (the sample-based estimate of) the autocorrelation matrix of data vectors in .The solution of Equation ( 8) is the principal eigenvector of .

Sector-based Data Clustering Algorithm
(1) Randomly initialize each of the J sector central rays to one of the observation data points and unit-normalize these vectors.
(2) Partition the observed data points into J data sectors by assigning each data point to its nearest sector based on the distance between the data vector and the sector central ray r, calculated by .
(3) Update the J sector central rays by finding the principal eigenvector of each of the sample-based correlation matrices , , determined by the data partition in step (2).
(4) Terminate if there is no change in the total clustering distortion shown in Equation ( 9), from the previous to the current iteration; otherwise, go to step (2).
The sector-based clustering algorithm monotonically descends in the clustering distortion where is the matrix composed of sector central rays and is the partition of data points into J data sectors.It also terminates in a finite number of iterations at a fixed point solution that is a local minimum of Equation ( 9), which can be proved following the standard convergence proof of the generalized Lloyd algorithm [21,22].The computational complexity of this algorithm is dominated by the partitioning step, whose complexity is O(JMNI),where I is the number of algorithm iterations.Random initialization of the sector central rays can affect the local optimum to which the algorithm converges; thus, in practice, the algorithm is usually run multiple times, with the sector partition with the minimum clustering distortion chosen as the final outcome.

C. Convex Analysis of Mixtures
At this juncture, having performed sector-based clustering, we have R as a noise/outlier mitigated representation of the data matrix X. Accordingly, supported by Theorem 1, which says that in the noise-free case, the columns of A are the lateral edges of  , it is reasonable, in the noise-mitigated case, to estimate the columns of A based on the lateral edges of the cone  .CAM uses the following algorithm specifically designed based on Theorem 2 to detect the lateral edges of cone  .

Cone Lateral Edge Detection Algorithm
(1) Set , , and (or another small positive value); Set .
(2) Determine projection image  by projecting onto cone  , where is the matrix resulting from removing the jth column from ; (3) If  , ; otherwise, remove , i.e. the jth column, from and ; (4) If , end the algorithm; otherwise, go to step (2).
The worst-case computational complexity of the cone lateral edge detection algorithm is O .After applying the algorithm, the column vectors in are the detected edges.The detected edge number has some dependence on the sector-based clustering solution, including the chosen number of sectors, J. Clearly, one must choose .In practice, to ensure this, one may choose J fairly large, in which case there are usually more than K detected edges.Regardless, the detected edges are good candidates, from which to select a subset as estimates of .To identify good, refined estimates of from this candidate pool, a combinatorial search based on a model fitting error criterion can be performed to identify the most promising K lateral edges.Specifically, let be any size-K subset of .The K lateral edges with sector indices that minimize a model fitting error are chosen, as follows: is the projection of onto cone  and is the number of data points in sector j.Because the an l b w h " " l y central rays confined within  , and their projections on  are all 0, the model fitting error is a weighted sum of h l b w h " " l y and their projections, and the weights are the data sector population sizes.Because this model fitting error is monotonically decreasing as the edge set under consideration enlarges, the search for the best K lateral edges can be accelerated by using the branch and bound search algorithm [23], which guarantees finding the edge set minimizing the model fitting error without the need for exhaustive search.The average complexity of branch and bound search is no larger than O , where is a constant that is problem-dependent [41].
The edge set minimizing the model fitting error forms the estimate of the mixing matrix, which we denote by .We then project all the mixture data vectors in X onto the cone  and compose these projected vectors into a matrix,  .This projection step ensures that our estimates for the sources will be non-negative and also helps to suppress noise existing in X.

If
has full column rank, the estimates of sources are calculated via the generalized inverse of , i.e.


, which is the projection of X onto the cone  , it can be shown that , where denotes the Frobenius norm of a matrix and denotes the set of K by N non-negative matrices.Thus, is actually a non-negative least squares estimate.
In summary, we note that CAM entails four sub-algorithms that involve minimizing an objective function: 1) sector-based clustering; 2) cone lateral edge detection; 3) estimation of the mixing matrix column vectors through minimization of model fitting error; and 4) stability-based source number estimation.For three of these algorithms, the globally optimal solution that minimizes the given objective function is found.Only the sector-based clustering is subject to finding (potentially poor) locally optimal solutions; however, this is substantially mitigated by performing the clustering multiple times from different random initializations and picking the best solution.

D. Detection of Source Number by Stability Analysis
One important CAM issue is detection of the structural parameter K (the number of underlying sources), often called model selection.This is indeed particularly critical in realworld applications where the true structure of the latent variable model may be unknown a priori.We propose to use a stability analysis scheme to guide model selection, based on a carefully designed model instability index.
Similar to the rationale in determining the number of clusters in data clustering using stability analysis [24], the basic principle is that, if K is too large, some extracted sources will simply model random noise patterns; on the other hand, if K is too small, some extracted sources will be arbitrary combinations of true sources; both scenarios produce unstable models.Stability analysis assesses the model instability indices associated with different values of K, calculated based on a large number of 2-fold cross-validations, and selects the model order with lowest model instability.In each crossvalidation trial l , the preprocessed observation data are randomly divided into two folds (indexed by and ) of equal size; then, CAM is applied on both folds and produces two independent estimates of the mixing matrix, denoted as and , respectively, for , where is the maximum source number under consideration.We then define the Normalized Model Instability (NMI) index as where and are estimates of the mixing matrix formed by randomly selecting K sector central rays from the sector-based clustering result obtained on data folds 1 and 2 in the lth cross-validation, respectively, and where here denotes the minimum average angle between the column vectors of two input matrices.To explicate this averaging, let the two input matrices be U and W . is calculated as where is the set including all permutations of and is the kth element in a permutation .Since the association between column vectors in U and W is not known, we need to search through all possible associations to find the optimal one.Using the Hungarian method, the complexity of this search is O [25].The definition in Equation ( 11) produces an NMI index that is y p h " l z " automatically adjusts the NMI index for comparison across different model orders as adopted by [24].

IV. EXPERIMENTS AND RESULTS
In this section, we first validate the CAM principle on synthetic data and numerically mixed image data, and then show its superior performance against a panel of benchmark BSS techniques on numerically mixed gene expression data under various parameter settings.We evaluate the algorithm performance by comparing the estimates of the mixing matrix and sources to the ground truth, together with the accuracy of source number estimation measured over a number of data set replications.We apply the minimum average angle defined in Equation ( 12) to assess the accuracy in estimating the true mixing matrix A, via where is the estimate of A.
takes a value between 0 and 1, with indicating perfect estimation.The calculation of minimum average angle produces an association between the column vectors in and the column vectors in A, which also indicates the association between estimated sources and ground truth sources.To assess the accuracy of source recovery, we use the average correlation coefficient between true sources and their estimates, i.e.
where is the estimate of the kth source that is the kth row of source matrix S, and denotes the correlation coefficient between them.After validating the CAM principle and showing its superior performance on realistic simulation data, we proceed to apply CAM to real-world scientific problems.Specifically, to dissect DCE-MRI data and real microarray gene expression data, where we evaluate the obtained results against existing biological knowledge.

A. Demonstration of CAM on Synthetic Data and Numerically Mixed Image Data
To illustrate CAM, we first consider a simulated data set consisting of data points.Half of the source vectors are drawn from a three-dimensional exponential distribution with independent variables to ensure the existence of approximate WGPs.The other half are first drawn from a three-dimensional Gaussian distribution with correlated variables to ensure source dependence and then absolute values are taken to force source non-negativity.The mixing matrix, source mean vectors, and covariance matrix are given as .
The additive noise is drawn from a Gaussian distribution with .
The structure of this data set has been chosen in order to illustrate the noisy and strongly correlated nature of many real data sets.The dataset has an SNR of 12.4dB, calculated by Equation (6).
After data preprocessing, we kept the 800 data points whose vector norms are largest and performed sector-based clustering on these data points 20 times with , selecting the best clustering outcome measured by the total clustering distortion given in Equation ( 9).On the sector central rays obtained from the best clustering outcome, we performed the cone lateral edge detection algorithm and then identified the three edges that minimized the model fitting error according to Equation (10) to form the estimate of the mixing matrix.The sources were recovered using the mixing matrix estimate accordingly.The resulting recovery accuracies and are 0.9826 and 0.9171, respectively.Fig. 3 shows the 800 large-norm data points (black dots), the data sectors and central rays obtained from the best clustering outcome, the detected cone lateral edges (red circles), the three edges producing minimal model fitting error (red circles pointed by arrows), and the ground truth mixing matrix column vectors (blue diamonds).We also applied stability analysis with 30 cross-validations, and obtained NMI indices that show a minimum value at (see Table 1 for NMI indices of different model orders), which agrees with the ground truth.The power of the CAM approach is supported here as both the mixing matrix and hidden sources are well recovered and the number of hidden sources is correctly identified.
As an example of a realistic problem, we consider data sets containing numerical mixtures of images.Three ( ) images (see Fig. 4a) were mixed to produce observation images (see Fig. 4b and Fig. 4d) in both exactdetermined ( ) and over-determined ( ) scenarios, where the randomly generated mixing matrices are  4c and Fig. 4e for the exact-determined and over-determined scenarios, respectively.and are 0.9781 and 0.9926, respectively, in the exact-determined scenario.In the over-determined scenario, and are 0.9761 and 0.9832, respectively.From Table 1, we can see that the stability analysis based model order selection detects the correct number of source images (i.e. 3) in both the exactdetermined and over-determined scenarios.
As a more complex problem, we considered numerical mixtures of four real microarray gene expression profiles ( ), which are from four distinct ovarian cancer subtypes, i.e. serous, mucinous, endometrioid, and clear cell [26].The sample labels of the gene expression profiles serving as sources are CHTN-OS-115, UM-OM-001, CHTN-OE-047, and CHTN-OC-033 [26].The sources contain expression  levels of genes, some of which are approximately WGPs.The source profiles are highly correlated, with an average pair-wise correlation coefficient of 0.83; also, the source vectors of many genes have very small vector norms.To enable applicability of the NMF methods, we limited mixing matrices to be non-negative.We consider exactdetermined ( ), over-determined ( ), and under-determined ( ) scenarios, 100 randomly constructed mixing matrices for each scenario, and 6 different SNR levels based on zero-mean white Gaussian additive noise.The mixing matrices are required to have unit row-sums.In the exact-determined and over-determined scenarios, they have a condition number 4, so that (A4) holds well.In the under-determined scenario, they satisfy that ,  to ensure that (A3) holds well, where  is the projection of on  .To enable the applicability of NMF and SNMF, all observed negative values in data were truncated to 0. In total, there are 1,800 simulation data sets.
For CAM, we set the sector numbers and , with the results indexed by CAM-20S and CAM-30S, respectively.Data preprocessing removed half of the data points with small vector norms.The sector-based clustering always chose the best outcome from 20 independent runs.Stability analysis used 30 cross-validations.We calculated the performance measures for recovering the mixing matrix and the whole gene expression source profiles.More importantly, we also calculated source recovery accuracy over the top source-specific genes --800 genes for each ovarian cancer subtype, selected to maximize , .The distinct source patterns over these genes that are highly expressed in a specific ovarian cancer subtype are of great interest in biological study [27].
When evaluating the accuracies of recovering the mixing matrix, sources, and distinct patterns of sources, the number of sources ( ) was assumed known and used as an input parameter for all the algorithms.All mixture gene expression profiles were normalized by scaling to have a unit sum before applying CAM and other methods.Principal Component Analysis (PCA) was used to convert an over-determined case to an exact-determined case when applying nICA, SNICA, and N-FINDR in the over-determined experimental scenario [4,5], because these methods can only work in the exactdetermined case.Random initialization was used for setting the initial algorithm parameters needed to run the competing methods.NMF used the multiplicative update rule proposed in [40].SNMF used the multiplicative update rule proposed in [8], with the source sparseness and model fitting error equally weighted in its objective function.NMF and SNMF terminated when the absolute changes of their objective function values were no larger than 0.0001% or when their numbers of interactions exceeded 5000.The iterative gradient search algorithm of nICA terminated when the mean squared error or its absolute change is smaller than 1×10 9 or when the number of interactions exceeded 5000 [6].SNICA used a simulated annealing algorithm based on constrained Metropolis-type Monte Carlo search to minimize the mutual information between recovered sources [7].In the initial stage, the Metropolis temperature parameter was set at 0.01, and in the refine stage it was set at 1×10 6 .The algorithm terminated when the minimum mutual information obtained during the entire run did not decrease in 200 successive Monte Carlo steps.The VCA algorithm requires the SNR to either be estimated or to be input to the algorithm.We found that VCA performance was very poor when the algorithm used its own internal estimation of the SNR.Thus, in our experiments we input the SNR as 100 dB, which basically indicates the data is almost noise-free.This gave more reasonable VCA performance.
Fig. 5 shows the performance results in the exactdetermined and over-determined scenarios, when the correct number of sources is given.The estimation accuracies on the mixing matrix, whole hidden sources, and distinct source patterns, are averages over 100 simulation datasets.It can be seen that both CAM-20S and CAM-30S outperform all five peer methods in all cases, and most importantly, they consistently achieve higher accuracy in recovering the distinct source patterns.It should be noted that the use of an overall correlation coefficient in assessing the estimation accuracy of sources may be misleading when the underlying sources are already highly correlated, and the correlation coefficient calculated over the distinct source patterns should be a more meaningful accuracy measure [3].nICA consistently produced the overall worst unmixing performance among peer methods, which confirms the infeasibility of ICA based approaches to solve BSS problems when the sources are correlated.In all circumstances NMF and SNMF consistently produced similar results.Logically, if the sources are not globally and sufficiently sparse, an insignificant difference between the performances of SNMF and NMF is expected.Though VCA and N-FINDR also exploit the idea of well-grounded sources, they are very sensitive to noise or outliers and thus produce unsatisfactory performance compared to CAM.
To assess the performance of stability based model selection (a unique feature of CAM) at each SNR level, we measured the frequency with which CAM correctly detected the number of sources, over the 100 simulation datasets.Fig. 6 shows this accuracy at different SNR levels in exactdetermined and over-determined scenarios.For both CAM-30S and CAM-20S, the number of sources ( ) was always accurately detected for SNR higher than 25dB in exact-determined and over-determined scenarios.At lower SNR levels, CAM-20S shows a more robust performance against noise than CAM-30S.Fig. 7a shows that CAM can recover the mixing matrix reasonably well over the entire tested SNR range in the underdetermined scenario, when the number of sources is given.Fig. 7b shows the accuracy of model order selection in the under-determined scenario, indicating that when the SNR level is higher than 25dB both CAM-20S and CAM-30S detect the correct source number (i.e. 3) on more than 80% of the datasets.In both Fig. 7a and Fig. 7b, some slight performance drop is observed in the under-determined scenario when the SNR is increased toward its high end, possibly due to over-compensation for the noise by the clustering scheme when the noise level is low.To compare computational complexity of the methods, we recorded the execution times of all methods, analyzing the 100 datasets for an SNR of 22dB on a computer with a 1.60GHz CPU.The analyses were run with the true number of sources known and with the parameter setting as described above.All methods were implemented in Matlab for a fair comparison, expect for SNICA, which was implemented in C. The mean and standard deviation of execution times in seconds are (a) (b) Fig. 6.Model order selection accuracy of CAM.(a) and (b) are the model order selection accuracies obtained in the exact-determined and overdetermined scenarios, respectively.The model order selection accuracy of CAM-30S at 19dB is 97% and not drawn in (a), because it is misleading.At 19dB, some of the estimates of mixing matrix obtained by CAM-30S tend to be a permutation and scaling matrix, which indicates poor unmixing.Without effective unmixing, the mixture data dimension is mistaken as the estimated source number that equals the true source number in the exact-determined case, which gives rise to the misleading high model order selection accuracy.presented in Table II.VCA is the fastest among all methods, followed by nICA, and then SNMF, N-FINDR and NMF.CAM is slower than these methods, but faster than SNICA, which uses Monte Carlo stochastic search and is the slowest among all competing methods, even with its implementation in C. CAM-30S is slower than CAM-20S as expected, because sector-based clustering takes more time when there are more sectors and the estimation of mixing matrix column vectors through minimization of model fitting error may also take more time due to possibly a larger number of detected edges.We also ran the CAM algorithm with stability analysis based model order selection using 30 cross-validation trials to determine the source number.The mean and standard deviation of execution time for CAM-20S with model order selection are 585.05seconds and 104.92 seconds, respectively.The mean and standard deviation of execution time for CAM-30S with model order selection are 1019.00seconds and 161.03 seconds, respectively.

C. Analysis of Breast Cancer DCE-MRI Data
As an example of using CAM for real-world application, we considered DCE-MRI data from breast cancer to evaluate tumor vasculature patterns [3].The data include MRI images of breast tumors taken at sequential time points after the injection of molecular contrast agent into the blood.Due to intratumor heterogeneity and limited imaging resolution, the concentrations of the contrast agent at many image pixels often represent a mixture of more than one vascular compartment, each with distinct and characteristic perfusion and permeability.The existence of near-pure compartment pixels allows us to use CAM to identify distinct vascular compartments and their spatial distributions within a tumor.
The DCE-MRI dataset includes image frames of a breast tumor (see Fig. 8a) taken every 30 seconds, starting from 90 seconds after injection of the molecular contrast agent.Each image contains pixels, and after masking out the non-tumor region, the resulting image contains pixels for CAM analysis.Noise filtering removed 30% of the pixels whose vector norms were small.The sector-based clustering chose the best clustering outcome in 20 independent runs, with cluster number .We performed stability analysis via 30 cross-validations, which suggested the compartment number , as summarized in Table 1.
CAM analysis indicates three compartments, i.e. fast-flow, slow-flow, and plasma input [28], characterized by their pharmacokinetics patterns.Fig. 8b shows the dynamic changes of tracer concentration of the three compartments, which are the column vectors in the recovered mixing matrix (with some proper rescaling) [3].Fig. 8c shows the spatial distributions of the identified compartments, which correspond to the recovered sources .
The fast-flow compartment has a fast tracer clearance rate F 8b h p ph l " " f h (see Fig. 8c).The slow-flow compartment shows very slow tracer kinetics (see Fig. 8b) and dominates the " " f the tumor (see Fig. 8c).The identification of fast-flow and slow-flow pools is plausibly consistent with previously reported intratumor heterogeneity [29,30].The defective endothelial barrier function of tumor vessels results in spatially heterogeneous high microvascular permeability to macromolecules [29,30].It has been reported that the p ph l " " f v b f h v v angiogenesis that is essential to tumor development [29].This rapidly proliferating neovasculature is often abnormal, and forms leaky and chaotic vessels, giving rise to a rapid tracer uptake and washout pattern, forming the fast-flow pool [30].O h h h h " " f h h significantly lower blood flow and oxygen concentration because the tumor growth requires a large portion of its blood supply and also neovessel maturation, forming the slow-flow pool with much slower tracer accumulation and washout [30].

D. Analysis of Muscle Regeneration Time-Course Gene Expressions
As a final example, we applied CAM to dissect a timecourse gene expression dataset obtained from a mouse skeletal muscle regeneration process [31].Skeletal muscle regeneration is a highly synchronized process involving the activation of various cellular processes.Cells grow in dynamically evolving subpopulations, yet the dynamics and proportions of cell subpopulations often go unmeasured on the basis of their mRNA expression patterns [32].Within a mixed population of cells, one might expect distinct cell types to exhibit some distinct patterns of gene expression, and the measured mRNA levels in the mixed cell population represent a weighted average of these hidden biological processes, where the weights are cell proportions involved in different biological processes.Here, we ask whether it is possible to deconvolve the gene expression data from a mixed cell population to discern the proportions of different cell types, by treating specific mRNA patterns as cell-type specific markers [32].
The time-course muscle regeneration gene expression data were acquired at successive time points using microarrays after the injection of cardiotoxin into the mouse muscle, which damages the muscle tissue and induces staged muscle regeneration [31].Standard preprocessing suggested reliably expressed genes for subsequent CAM analysis [31].Noise filtering removed 40% of the genes whose vector norms were small.The sector-based clustering chose the best clustering outcome in 20 independent runs, with cluster number .We performed stability analysis via 30 cross-validations, which suggested as the number of potentially distinct sources associated with underlying active biological processes, as summarized in Table 1.
Fig. 9 displays the source-specific time activity curves (the column vectors of the estimated mixing matrix) that represent the proportions of cell subpopulations associated with the 4 underlying putative biological processes at each time point.For each of the identified sources, we selected 200 sourcespecific genes (near-WGPs) that maximize , , to define source-specific distinct patterns [27].We input the four source-specific gene groups into Ingenuity Pathway Analysis (IPA), a comprehensive database of gene annotations and functions that performs Fisher's exact test to assess the association of a given gene set with known biological functions, with p-values indicating the significance level.Functional analysis by IPA consistently suggests the biological plausibility of all four biological processes revealed by CAM.Specifically, IPA suggests that source 1 is associated with inflammation, connective tissue disorders, skeletal and muscular disorders, and immune response, with p-values of 6.77E-39, 9.02E-35, 9.02E-35, and 9.62E-32, respectively.The corresponding genes are heavily involved in the necrosis of damaged muscle tissue and the activation of an inflammatory response.In Fig. 9, it can be seen that source 1 activates immediately after muscle damage and then diminishes quickly, reflecting the fact that necrosis and inflammatory response constitute the first transient phase of muscle regeneration [33].IPA suggests that source 2 is associated with three biological functions, i.e. (1) cell cycle, (2) DNA replication, recombination, and repair, and 3) cellular growth and proliferation, with p-values of 7.07E-25, 3.77E-17, and 2.10E-8, respectively.The associated genes are actively involved in myogenic cell proliferation to prepare sufficient myoblasts for later differentiation.The source 2 activity reaches its peak(s) from day 2 to day 4 as biologically expected (see Fig. 9) [33].IPA suggests that source 3 is associated with tissue development, skeletal and muscular system development, cell to cell signaling and interaction, and connective tissue development and function, with p-values of 9.09E-16, 4.91E-11, 2.33E-08, and 4.35E-07, respectively.The corresponding genes are expected to facilitate the differentiation of myoblast into mononucleated myocyte and the fusion of myocytes to form multinucleated myofibers.As expected, in Fig. 9 the source 3 activity goes up after sufficient myoblasts are produced by the activity of source 2, keeps at a high level from day 5 to day 13, and then goes down.Such a trend is consistent with the widely observed fact that muscle regeneration is accomplished in approximately two weeks [33].IPA suggests that source 4 is associated with skeletal muscular system function and tissue morphology, with a pvalue of 3.49E-10.The corresponding genes are typically active in normal muscle cells, whose activity drops dramatically after muscle is damaged and gradually recovers until it finally reaches a similar level of original muscular activity as at day 0 (see Fig. 9).

V. CONCLUSION AND DISCUSSION
We have presented a novel approach to separate nonnegative well-grounded sources from observed mixtures, which is geometrically principled and which, as illustrated by real examples, can be very effective at revealing hidden sources within data.It is worth noting that there are four novel features associated with CAM.First, we show both feasibility and optimality of CAM regarding the existence of WGPs and source-dominant points, via newly proved theorems.We prove for the first time a sufficient and necessary condition for identifying the mixing matrix in non-negative well-grounded BSS problems through edge detection.We also show the optimality of the edge detection strategy that identifies the data points with maximum source dominance.Second, we propose an effective noise and outlier removal scheme based on sector-based clustering and an efficient lateral edge detection method on the clustered data scatter plot.Third, based on the proposed identifiability condition of the mixing matrix, the CAM methodology, including the edge detection method and the stability-based model order selection, can be uniformly applied to the exact-determined, over-determined, and under-determined cases, which enables CAM to identify an under-determined problem when encountering such a case.Fourth, we apply CAM to real gene expression data and DCE-MRI data and validate the results against well-established scientific knowledge.
There are important differences between CAM and existing methods for separating non-negative well-grounded sources, such as N-FINDR and VCA [3,14,15].At the front end, these existing methods usually assume the source number is known and apply dimension reduction methods, such as PCA, and the normalization scheme specified in Appendix D on the observed mixture data, so that data points form a simplex in the dimension-reduced space with WGPs being the simplex vertices.The differences between CAM and these simplexbased methods are two-fold.First, CAM does not require prior knowledge of the source number, and uses edge detection and its associated stability-based model order selection to identify the source number.The proposed method can be applied to the exact-determined, over-determined, and under-determined problems, as mentioned above, which enlarges the application range of CAM.Second, CAM is solely based on convex cone and edge detection, which does not require dimension reduction.Without prior knowledge or accurate estimation of the source number, dimension reduction may over-reduce the data dimensionality, with the grave consequence of transforming an exact-determined or over-determined problem into an under-determined one, for which the sources are usually unidentifiable.
A convex model for NMF has been developed in [20], which adopts the source well-groundedness assumption and also uses a clustering method to reduce data points and noise.Compared to this method, CAM has greater applicability by allowing the mixing matrix to have negative elements.Another work related to CAM is [34], which identifies WGPs by examining each observed data point to see whether it is confined within the cone formed by other data points, although the authors do not explicitly formalize their method as an edge detection scheme.Our work enhances the edge detection strategy for separating non-negative well-grounded sources by proving its optimality and identifiability condition.For identifying the mixing matrix in the under-determined case, sparse component analysis proposed in [35] assumes that there are on average k sources contributing to each data point and that k is known a priori.[35] uses partial k-dimensional subspace clustering to recover the mixing matrix, which can be viewed as an extension of the edge detection strategy from one-source dominant WGPs to k-source dominant data points spanning subspaces.CAM allows the mixing matrix to have mixed signs.By requiring both the mixing matrix and sources to be non-negative, methods have been developed to separate well-grounded sources under the NMF framework [36,37].The method proposed in [36] identifies WGPs one-by-one by detecting the extreme ray farthest away from the cone formed by the WGPs that have already been detected.[37] proposed a scalable and efficient method for solving problems where .While CAM is fitting one convex cone to the data, [38] proposed to model the data with multiple small convex cones to accommodate manifold structure in the source signals.
Both probabilistic methods and deterministic methods have been used to solve BSS problems, and there is usually a connection between the two kinds of methods [39].The proposed CAM method is largely a deterministic approach.It is an interesting topic to build a probabilistic model for separating non-negative well-grounded sources.We are currently investigating a probabilistic CAM model that combines geometric convex analysis with probabilistic modeling.Within a probabilistic modelling framework, information-theoretic criteria, such as minimum description length [3], can be used for model selection to determine the source number.Besides the applications for analyzing genomic data and images, CAM can also be applied to many other analyses, such as document topic modeling [39].

An
open-source platform-independent software implementation of the CAM algorithm in R and Java is available from: http://www.cbil.ece.vt.edu/software.htm.

A. Proof of Lemma 1
First, we prove that (A3) is a sufficient condition.Suppose that (A3) holds. ,

Definition 1 .( 3 ) 2 .
Given a matrix B composed by its set of column vectors , the convex cone determined by is  Definition A non-zero vector z is a lateral edge of  , if  (i.e. ,

Fig. 1 .
Fig. 1.Illustration of a convex cone  with three edges in three dimensional space.Lines with an arrow are the axes.Bold lines are edges , and .The cross-section of convex cone  is a triangle, indicated by grey color.The star markers on the edges are well-grounded points.v is a point outside of  .Its projection on  is  . is the projection angle.

Fig. 2 .Definition 4 .
Fig. 2. Illustration of sector-based clustering in a three-dimensional scatter plot.Four sources (K = 4) are mixed to form three mixtures (M = 3).Small circles are data points.After clustering, each data sector is represented by a sector central ray (solid lines).Four data sectors are close to or on the true edges of cone  , with their sector central rays indicated by bold lines.The quadrilateral formed by the dashed lines indicate the intersection of the cone.

Fig. 3 .
Fig. 3. Perspective projection of the 800 large-norm data points in the toy dataset onto the 2-D intersection of the convex cone formed by the data points.Perspective projection performs simple positive scaling of data points to make every data point have unit element sum.Black dots are data points.Each data point is connected to its sector central ray by a line.Red circles indicate the edges detected by applying the lateral edge detection algorithm on the sector central rays.Blue diamond markers indicate the positions of mixing matrix column vectors.The three edges that minimize the model fitting error among all three-edge sets are indicated by arrows.
performed CAM on the mixture image data.Sector-based clustering was run 20 times to select the best clustering outcome, with the sector number set to 70.Stability analysis used 30 cross-validations to detect the number of sources.The source images were quite well recovered and are shown in Fig.

Fig. 4 .
Fig. 4. Separation results of numerically mixed images.(a) Three source images.(b) Three mixture images obtained by mixing the source images in the exactdetermined scenario.(c) Recovery result produced by applying CAM to the mixture images in (b).(d) Five mixture images obtained by mixing the source images in (a) in the over-determined scenario.(e) Recovery result produced by applying CAM to the mixture images in (d).

Fig. 5 .
Performance comparison of CAM and peer methods.(a) and (d) are comparisons on accuracy of recovering the mixing matrix in the exact-determined scenario and over-determined scenario, respectively.(b) and (e) are comparisons on accuracy of recovering sources in the exact-determined scenario and overdetermined scenario, respectively.(c) and (f) are comparisons on the accuracy of recovering distinct patterns of sources in the exact-determined scenario and over-determined scenario, respectively.

Fig. 7 .
The performance of CAM on recovering (a) the mixing matrix and (b) the source number in the under-determined scenario.

Fig. 8 .
Fig. 8. CAM analysis result on breast cancer DCE-MRI data.(a) MRI images of a breast tumor taken at sequential time points after the injection of molecular contrast agent into blood.(b) Tracer concentration changes of the three identified compartments over time.(c) Recovered source images of the three compartments.

Fig. 9 .
Fig. 9. Time activity curves of the four sources detected on the 27 time-point skeletal muscle regeneration gene expression dataset.
It states that for separating non-negative well-grounded sources, (A3) is a sufficient and necessary condition for an edge detection solution uniquely identifying the mixing matrix A based on the observed data X.The lateral edges of cone