Collaborative analysis of multiple large orthogonal data sets presents many challenges. Here are some of the solutions created and used in the Pan-Cancer analysis initiative.
We want to be able to verify data integrity while tracking the provenance and versions of all data and results—as well as the software that has been run on them to generate derivative data sets. Different groups have different platforms, so compatibility, standards and documentation are essential to ensure integrity and to maximize data reuse. In addition, the complexity of the data poses challenges in visualization and how to explore them effectively to extract relevant knowledge.
Heat maps are a mainstay for visualizing clustered or ordered data matrices and identifying groups of related data points. Many tools produce heat maps, but most produce static images with fixed sort order and no option to select which data are displayed, providing no capability to explore the data interactively. Four tools that overcome these limitations are Gitools, interactive heat maps (PLoS One 6, e19541, 2011), Next-Generation Clustered Heat Maps NG-CHM and the UCSC Cancer Genomics Browser.
Gitools heat maps—whose columns and rows normally represent tumor samples and genes respectively—can represent multiple values in each cell, which makes them especially well suited for the representation of multidimensional cancer genomics data. Their interactive capabilities allow the user to filter, sort, move and hide rows and columns in the context of gene and tumor sample annotations and to launch several common exploratory analyses, such as correlations, clustering, enrichment and statistical comparisons between groups of samples. A built-in option allows users to sort the genes and samples within a heat map, following the pattern of the mutual exclusivity of alterations. TCGA Pan-Cancer data are ready to be browsed with Gitools interactive heat maps at http://www.gitools.org/datasets. The Gitools Pan-Cancer multidimensional matrix contains information for 4,678 samples and 22,046 genes, with 4 values per sample and gene, including expression, copy number and mutation data. In addition, numerous clinical and molecular annotations of samples and genes are available to be added in the visualization and used to filter, sort and analyze the data. The candidate cancer driver genes predicted using multiple methods, including MutSig, MuSiC, OncodriveFM, OncodriveCLUST and Activer driver, are also available (Sci. Rep. doi: 10.1038/srep02650). In addition, all the results reported by IntOGen-mutations are also available to be browsed (Nat. Methods doi:10.1038/nmeth.2642).
Digging into TCGA data with interactive graphics
The 1990s saw the introduction of clustered heat maps for omic biology (Science 275, 343–349, 1997). They have since become the most popular way to visualize patterns in molecular profiling data, for example, from microarrays and sequencing technologies. They have been included in all of the TCGA papers published in Nature so far on the cancers of specific tissue origin. Of necessity, however, they have been included in the pages of the journal as static images. We have now developed 'next-generation' clustered heat maps, which use a Google Maps–like tiling technology for extreme zooming and navigation without loss of resolution. Next-generation clustered heat maps provide pathway and Gene Ontology (GO) information, chromosomal interactive ideograms, recoloring on the fly, high-resolution graphics output and linkouts to public information resources (for example, to the cBio Portal) on genes, proteins, pathways and drugs. Perhaps most importantly, all of the metadata elements needed to reproduce them months or years later are captured and automatically saved. The result is a visually rich, dynamic environment for the exploration of the masses of data produced by TCGA. The Compendium of TCGA Pan-Cancer NG-CHMs (Figs. 1, 2a,2b,2c,3,4,5,6) as of August 2013 displayed 676 maps as an initial set, but the numbers will soon rise into the thousands as more data types, tumor types and algorithms are incorporated.
An example: visualizing the pan-cancer phosphoproteome
Ordinating cancer information
Cancer can yield many types of alterations, and cancer genomics demands many forms of visualization to best assess them collectively.