Progeny Clustering: A Method to Identify Biological Phenotypes

Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset.

Here, d ij denotes the dissimilarity between any of the two objects in the same cluster measured by Euclidean distance, and n r is the number of objects in cluster C r . The log W K is then compared with its expectation through the definition of Gap as The expectation of log W K is estimated by uniformly drawing B reference datasets either (a) over the range of each feature or (b) over a box aligned with principal components of the data. In our testing, the former prescription was implemented and 50 reference datasets were generated for each dataset. Let sd K be the standard deviation and The optimal number of clusters k optimal is chosen via The Gap curves on all the datasets are illustrated in Figure S1, and the optimal number of clusters chosen from each curve is listed in Table 2.

Silhouette
The Silhouette score [2] is defined as Here, a (i) is the average dissimilarity of i to all other members in the same cluster, and b (i) is the minimum of the average dissimilarity of i to members in another cluster. We used the default Silhouette function in Matlab to compute the score, in which dissimilarity is measured by Euclidean distance. The average Silhouette score was then taken as the quality measure for each number of clusters. The results from Silhouette for all the datasets are summarized in Table 2, and the Silhouette score curves are shown in Figure S2.

Clest
The Clest algorithm [3] in brief follows the four general steps: (1) randomly splits the original dataset into a learning set and a test set; (2) builds a classifier by clustering the learning set; (3) applies both the classifier and clustering to the test set; (4) compares the two cluster assignments and outputs an external index t K as the similarity statistic. Similar to Gap Statistics, reference datasets are also generated to estimate the expectation of the similarity statistic. The optimal number of clusters is defined as in which d K denotes the difference between the observed similarity statistic and its expected value, and p K denotes the proportion of the reference similarity statistics that is greater than the observed value. The parameter values used for implementing Clest are listed in Table S1. The d K curve for each dataset is shown in Figure S3, and the optimal number of clusters is shown in Table 2.

Consensus Clustering
The key component of consensus clustering [4] is the construction of consensus matrix, which is similar to the co-occurrence matrix in our method. The method repetitively resamples and clusters the dataset. The cluster assignments in different perturbed datasets are recorded in connectivity matrices, the sum of which is normalized by resampling frequencies, resulting in a final consensus matrix. To choose the optimal number of clusters, the method uses consensus distribution to assess how the entries of the consensus matrix are distributed within the range from 0 to 1. A clean matrix, representing a good clustering, would have a consensus distribution concentrated at 0 and 1. The empirical cumulative distribution (CDF) is defined as where M (i,j) denotes the entry (i, j) of the consensus matrix M , N denotes the number of rows in M , and 1{. . .} is an indicator function. The optimal number K is chosen based on the highest (K) defined as  in which A(K) is the area under the CDF for k-cluster partition of the data. For implementation, we resampled the data 100 times to construct the consensus matrix. Both the CDF and (K) curves were plotted ( Figure S4 and S5), and the optimal number K for each dataset is shown in Table 2.

Model Explorer
The Model Explorer method [5] first generates two sub-samples (L 1 , L 2 ) with a ratio f of the original dataset, f ∈ (0.5, 1). The two sub-samples are then clustered separately, and the cluster assignments are compared at the intersection of the two sets by a specified similarity measure. Similar to Consensus Clustering, the optimal number of clusters is chosen based on the empirical cumulative distribution of the similarity measure. Here, we chose 0.8 for f , and used the Jaccard coefficient as the similarity measure, which is defined as The C 1 and C 2 are co-occurrence matrices for the intersection under two cluster assignments, and <, > represents the dot product of two matrices. The procedure was repeated 100 times, and the cumulative distribution curves of similarity were plotted ( Figure S6) for finding the optimal K ( Table  2).

Cell Phenotype Dataset
The cell phenotype dataset is derived from high content image analysis of 444 microscope images of single primary human umbilical vein endothelial cells. Four of these images were cells of interest (COI), that were cultured on fibronectin-coated glass coverslips, fixed, fluorescently immunolabeled for the cytoskeletal proteins vinculin and actin, and stained by DAPI for nuclei. The remaining cells were cultured on pattern configurations derived from the 4 COIs which were created via Image Guided Laser Scanning Lithography (IG-LSL). The process of patterning via LSL is described in detail elsewhere [6]. Briefly, it consists of four steps: (1) the image of the COI is processed to create a virtual mask, that defines regions of interest (ROIs) to control the laser position and shutter during Laser Scanning Lithography; (2) a focused 532 nm laser is used to photothermally desorb ROIs of an oligo(ethylene glycol) (OEG) terminated alkanethiol from a platinum-coated glass coverslip; (3) the surface is exposed to a solution of fibronectin that adsorbs exclusively to the bare platinum patterns where the OEG was removed; and (4) endothelial cells are seeded onto the surfaces. In total, four main categories of cells were patterned from the four original COIs, containing 102 crescent-shaped cells, 123 star-shaped cells, 179 square-shaped cells, and 40 Texas-shaped cells (Figure 4). There were also two subcategories of cells under each main category ( Figure S7) generated by different virtual masks from each COI: one used a single, continuous pattern derived from an outline of the COI, and the other used a discontinuous pattern composed of many individual features derived from an image of each COIs vinculin containing adhesion sites.
To ascertain the efficiency of this biomimetic nanolithography patterning and to identify common structural properties of endothelial cells, high content image analysis was performed on images of the four COIs and their respective patterned cells. This analysis is described in detail elsewhere [7]. Briefly, it provides a means to automatically obtain a suite of morphology metrics of single cells from images. 41 morphology metrics were applied to cells in this dataset. Metrics fell into six main categories: (1) contouring; (2) texture; (3) polarity; (4) adhesion sites; (5) intensity, area and shape; and (6) fiber alignment. K-means cluster analysis was then performed on the total population of 444 cells (each described by a set of 41 metrics) to determine how cytoskeletal and nuclei features of the patterned cells aligned with the COIs.   four-cluster ten-dimensional synthetic dataset using Hierarchical Clustering. The results are consistent with that using k-means (Figure 2 and 3).

AML Adhesion Pathway RPPA Dataset
Expression levels of 10 adhesion pathway proteins were obtained from 560 acute myeloid leukemia (AML) samples collected by researchers at the University of Texas MD Anderson Cancer Center from patient blood, marrow and plasma. The 10 adhesion-related proteins were chosen based on their hypothesized biological activity in leukemia and significance in determining patient survival outcome. Reverse phase protein array (RPPA) technology was employed to obtain these protein levels. In RPPA, an arrayer spotted whole cell lysates onto nitrocellulose slides in serial dilution, and each slide was probed with a highly validated antibody against a protein (total, phospho or cleaved) of interest [8]. The overall positive control lysate consisted of a pooled mixture of 11 AML cell lines, which were used to account for variations in staining, background and loading, while the lysate buffer served as the negative control. MicroVigene software was used to scan and digitize slides, and the data was normalized following established protocols [8,9].