Points of Significance: Clustering

Journal name:
Nature Methods
Volume:
14,
Pages:
545–546
Year published:
DOI:
doi:10.1038/nmeth.4299
Published online

Clustering finds patterns in data—whether they are there or not.

At a glance

Figures

  1. Similarity measures between expression profiles across n = 15 patients (dots) of five putative genes (blue) and a reference (gray).
    Figure 1: Similarity measures between expression profiles across n = 15 patients (dots) of five putative genes (blue) and a reference (gray).

    (a) Absolute expression profiles of genes A–E generated by various transformations from the reference. Their similarity to the reference is shown as the Euclidian distance expressed as root mean square (r.m.s.). Gene C is most similar to the reference (r.m.s. = 0.76), followed by gene B (r.m.s. = 1.52). (b) Profiles from a centered on their means and corresponding r.m.s. Gene A and reference profiles now overlap (r.m.s. = 0), and the similarity of gene E to the reference has decreased to be the same as that of gene C (r.m.s. = 0.76). (c) Profiles from a transformed into z-scores. Gene B has no profile because the z-score is undefined when no variation is present.

  2. Complete linkage clustering of five objects.
    Figure 2: Complete linkage clustering of five objects.

    (a) Pairwise distances (step 1) are used to merge objects (steps 2–4) where the maximum of all pairwise distances is used. At each merging step, the shortest distance is chosen (blue). (b) A dendrogram with a vertical axis showing the distance between merged nodes. To create clusters, one can cut the tree at a fixed height (dashed line).

  3. Dendrograms of hierarchical clustering of gene expression profiles based on correlation distance.
    Figure 3: Dendrograms of hierarchical clustering of gene expression profiles based on correlation distance.

    The data were generated by creating core profiles A1, B1, C1, D1, and E1 with correlation values of 0.7, 0.5, 0, −0.5, and −0.7 (respectively) with the reference profile R from Figure 1. For each core profile (e.g., A1), four additional highly correlated random profiles were generated (e.g., A2–A5). Profiles are colored by group and clusters formed by cutting at a fixed height (dashed line). (a) Complete linkage clustering tends to create balanced dendrograms by first clustering objects into small nodes and then clustering the nodes. (b) Single linkage clustering tends to create stringy dendrograms by first creating a few nodes and then adding objects to them one at a time.

  4. Simulation of 10,000 trials of k-means clustering with k = 3 of 35 points (black), of which 20, 10, and 5 were centered on each of the gray circles, respectively, and spatially distributed normally within the circle with s.d. half of the circle radius.
    Figure 4: Simulation of 10,000 trials of k-means clustering with k = 3 of 35 points (black), of which 20, 10, and 5 were centered on each of the gray circles, respectively, and spatially distributed normally within the circle with s.d. half of the circle radius.

    Centroids are indicated by colored hollow points; initial centroids were randomly selected points from the data set. (a) Evolution of a trial that results in the lowest total within-cluster distance, d = 38.4. With each iteration, d generally drops. Points are shown connected to and colored by their assigned centroid. (b) Histogram of the total within-cluster distance for 10,000 trials. The lowest d = 38.4 solution (a) was found in 1,236 (12%) of trials. Bar labels indicate figure panels in which the solution is shown. (c,d) Two most common solutions, their d and frequency observed. (e,f) Examples of solutions whose clusters do not follow the original grouping of points. (g) Solution with largest d.

References

  1. Everitt, B. Cluster Analysis (Heinemann Educational, 1974).
  2. Reynolds, A., Richards, G., de la Iglesia, B. & Rayward-Smith, V. J. Math. Model. Algorithms 5, 475504 (2006).
  3. Hartigan, J.A. & Wong, M.A. J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100108 (1979).
  4. Bryan, J. J. Multivar. Anal. 90, 4466 (2004).

Download references

Author information

Affiliations

  1. Naomi Altman is a Professor of Statistics at The Pennsylvania State University.

  2. Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.

Competing financial interests

The authors declare no competing financial interests.

Author details

Additional data