Main

We want to be able to verify data integrity while tracking the provenance and versions of all data and results—as well as the software that has been run on them to generate derivative data sets. Different groups have different platforms, so compatibility, standards and documentation are essential to ensure integrity and to maximize data reuse. In addition, the complexity of the data poses challenges in visualization and how to explore them effectively to extract relevant knowledge.

Heat maps

Heat maps are a mainstay for visualizing clustered or ordered data matrices and identifying groups of related data points. Many tools produce heat maps, but most produce static images with fixed sort order and no option to select which data are displayed, providing no capability to explore the data interactively. Four tools that overcome these limitations are Gitools, interactive heat maps (PLoS One 6, e19541, 2011), Next-Generation Clustered Heat Maps NG-CHM and the UCSC Cancer Genomics Browser.

Gitools heat maps—whose columns and rows normally represent tumor samples and genes respectively—can represent multiple values in each cell, which makes them especially well suited for the representation of multidimensional cancer genomics data. Their interactive capabilities allow the user to filter, sort, move and hide rows and columns in the context of gene and tumor sample annotations and to launch several common exploratory analyses, such as correlations, clustering, enrichment and statistical comparisons between groups of samples. A built-in option allows users to sort the genes and samples within a heat map, following the pattern of the mutual exclusivity of alterations. TCGA Pan-Cancer data are ready to be browsed with Gitools interactive heat maps at http://www.gitools.org/datasets. The Gitools Pan-Cancer multidimensional matrix contains information for 4,678 samples and 22,046 genes, with 4 values per sample and gene, including expression, copy number and mutation data. In addition, numerous clinical and molecular annotations of samples and genes are available to be added in the visualization and used to filter, sort and analyze the data. The candidate cancer driver genes predicted using multiple methods, including MutSig, MuSiC, OncodriveFM, OncodriveCLUST and Activer driver, are also available (Sci. Rep. doi: 10.1038/srep02650). In addition, all the results reported by IntOGen-mutations are also available to be browsed (Nat. Methods doi:10.1038/nmeth.2642).

Digging into TCGA data with interactive graphics

The 1990s saw the introduction of clustered heat maps for omic biology (Science 275, 343–349, 1997). They have since become the most popular way to visualize patterns in molecular profiling data, for example, from microarrays and sequencing technologies. They have been included in all of the TCGA papers published in Nature so far on the cancers of specific tissue origin. Of necessity, however, they have been included in the pages of the journal as static images. We have now developed 'next-generation' clustered heat maps, which use a Google Maps–like tiling technology for extreme zooming and navigation without loss of resolution. Next-generation clustered heat maps provide pathway and Gene Ontology (GO) information, chromosomal interactive ideograms, recoloring on the fly, high-resolution graphics output and linkouts to public information resources (for example, to the cBio Portal) on genes, proteins, pathways and drugs. Perhaps most importantly, all of the metadata elements needed to reproduce them months or years later are captured and automatically saved. The result is a visually rich, dynamic environment for the exploration of the masses of data produced by TCGA. The Compendium of TCGA Pan-Cancer NG-CHMs (Figs. 1, 2a,2b,2c,3,4,5,6) as of August 2013 displayed 676 maps as an initial set, but the numbers will soon rise into the thousands as more data types, tumor types and algorithms are incorporated.

Figure 1: The Pan-Cancer NG-CHM portal.
figure 1

In this screenshot, “BRCA – Breast invasive carcinoma” has been selected from the cancer type drop-down menu at the left to browse the set of large-icon heat maps for that disease. The profiling platform, the heat map type or any combination of the three criteria could have been chosen instead for searching. Clicking on any of the icons brings up a single interactive NG-CHM, which can be navigated and explored. The term 'Gene/Probe' represents a gene, transcript, protein or methylation probe, depending on the data type. Each NG-CHM includes images for three different variants of the data normalization algorithm; it is possible to toggle quickly between the three images to compare them. The compendium is available at http://bioinformatics.mdanderson.org/TCGA/NGCHMPortal/.

Figure 3: Figure 2A. A sample versus protein Pan-Cancer NG-CHM for RPPA protein expression.
figure 3

A, The entire NG-CHM (R. Akbani and G. Mills, personal communication). At the top of the figure are the dendrogram and a series of status bars that indicate interesting molecular, pathological or clinical characteristics of the samples (e.g., mutation or amplification of a particular gene or histological type). A number of patterns of red (high expression) and blue (low expression) in the heat map would be interesting to explore in greater detail, but the image resolution is not high enough to do so. One can therefore zoom in on various interesting patterns in the map. The black, green and blue cartouches indicate areas of the map that were zoomed to (symmetrically or asymmetrically).

Figure 4: Figure 2B
figure 4

Symmetrically zoomed view of the black cartouche region in 2A, which includes ER-α (ERALPHA). HER2 clustered elsewhere in the map (blue cartouche) but is shown here (as a separate bar) because it represents a novel story that relates to the ERALPHA data.

Figure 5: Figure 2C
figure 5

Asymmetrically zoomed view of the green cartouche region in 2A. Asymmetric zooming of only the vertical axes (using the lock at the upper right in 2A) allows the full width of the map to be represented. See http://bioinformatics.mdanderson.org/TCGA/NGCHMPortal/.

Figure 6: Protein versus protein mRNA expression NG-CHM for the Pan-Cancer set.
figure 6

(a) Entire NG-CHM. Protein names are suppressed because they would be too small to read. (b) Zoomed version of the lower right-hand corner, showing protein labels. (c) Movable, resizable windows: Image Details (top), Navigator (middle) and Classification Details (bottom). The Navigator window indicates the position of the view on the entire map and can also be clicked on to move the field of view. Each entry in the map expresses the relationship between protein expression patterns in terms of the Pearson correlation coefficient. Note the high positive correlation (red) between the expression of ER-α (ERALPHA) and the progesterone receptor (PR). See http://bioinformatics.mdanderson.org/TCGA/NGCHMPortal/.

Figure 7: Entire sample versus gene NG-CHM for renal clear-cell (KIRC) mRNA.
figure 7

At the top are status bars and menus for a large number of interactive map functions. To the right are movable, resizable Navigator, Image Detail and Category windows. The resolution is insufficient to see sample or gene labels; the zoomed view in Figure 5 shows the labels. See http://bioinformatics.mdanderson.org/main/NG-CHM:Overview.

Figure 8: Gene versus sample KIRC mRNA NG-CHM zoomed to show a red patch of high expression.
figure 8

Gene and sample labels are visible here. Also shown are menus for navigation (lower blue dialog box) and linkouts (blue box at upper right) plus a window for changing the color scheme. See http://bioinformatics.mdanderson.org/main/NG-CHM:Overview.

Figure 9: Some additional features of NG-CHMs.
figure 9

(a) Scattergram opened by clicking on a data point in a gene-versus-gene KIRC map. (b) Distribution of bootstrap P values for the same data point (i.e., for the Pearson correlation coefficient it represents). (c) GO enrichment statistics for a selected set of 218 genes. (d) Kaplan-Meier plot for patient samples selected by clicking on the dendrogram compared to those not selected. See http://bioinformatics.mdanderson.org/main/NG-CHM:Overview.

An example: visualizing the pan-cancer phosphoproteome

TCPA: a resource for cancer functional proteomics data

Jun Li et al. Nature Methods 10.1038/nmeth.2650

The Visualization module provides two ways to examine global protein expression patterns in a specific RPPA data set. One is through a “next-generation clustered heat map” (Fig. 1, v), which allows users to zoom, navigate and scrutinize clustering patterns of samples or proteins and link those patterns to relevant biological information sources. The other is through a network view (Fig. 1, vi), which overlays the correlation between any two interacting partners in the protein interaction network (curated in the Human Protein Reference Database8).

Figure 2: Overview of the TCPA data portal.
figure 2

TCPA contains six modules [including] the Visualization module, which has a “next-generation clustered heat map” view (v) and network view (vi).

Full size image

Ordinating cancer information

Cancer can yield many types of alterations, and cancer genomics demands many forms of visualization to best assess them collectively.

Exploring TCGA pan-cancer data at the UCSC Cancer Genomics Browser

Melissa Cline et al. Scientific Reports 10.1038/srep02652

There are many choices for cancer genomics visualization tools, as reviewed3. They fall into three general categories: genome-based, gene-based, and pathway-based. Genome-based methods such as the UCSC Cancer Genomics Browser and IGV (http://www.broadinstitute.org/igv/) are well suited for exploring alterations that follow genomic coordinates, such as copy number variations or DNA methylation profile. However, because genome-based visualizations only allow the user to see one or two genomic regions at once, they are generally not as effective for exploring possible connections between alterations in multiple different genomic regions. Gene-based visualizations are offered by GiTools (http://www.gitools.org/) [as used in the visualization of data by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and Tamborero et al., doi:10.1038/srep02650], cBio Portal (http://www.cbioportal.org/public-portal/) [as used by Ciriello et al., doi:10.1038/ng.2762], IntOGen (http://www.intogen.org) [as described by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and used by Tamborero et al., doi:10.1038/srep02650], IGV, and the UCSC Cancer Genomics Browser. These tools are useful for exploring alterations in multiple genes at once, especially when the set of genes has known functional significance such as known marker genes or genes in a common pathway. The UCSC Cancer Genomics Browser further allows probe-based visualization in its gene-based viewing mode, which is particularly effective for assessing exon expression or DNA methylation data in genes of interest. However, these heatmap based-methods do not generally indicate how genes interact in a pathway. Pathway-based methods such as Cytoscape (http://cytoscape.org/) and cBio Portal allow the user to see known functional connections between genes, but are limited to known pathways (which are limited in coverage) or predicted pathways (which tend to have high error rates). In short, no tool meets all purposes, and all tools have scenarios in which they are most effective.

The UCSC Cancer Genomics Browser presents the TCGA data and Pan-Cancer subtypes in a coherent, integrated system for both TCGA researchers and the scientific community at large. It provides direct access to and visualization of data at specific genes or genomic regions on samples of interest. It displays genomic aberrations alongside clinical and annotation features in a flexible, dynamic display, sorted by the features of the user's selection. Integrated Kaplan-Meier plots allows researchers to assess when change of cancer subtypes coincide with changes in survival.

The power of this approach is illustrated in Figure 1, which shows a heatmap of the somatic mutation profile of the significantly mutated genes in the TCGA acute myeloid leukemia (AML) cohort, as well as the corresponding AML subtype designations for these samples.

Figure 10: Using the UCSC Cancer Genomics Browser to explore relationships between somatic mutation profiles, genomic subtypes and survival.
figure 10

(a) Somatic mutations for the most-significantly mutated genes in TCGA AML tumor samples3. Samples are arranged in rows and genes in columns. Red indicates that the tumor sample harbors non-synonymous coding mutations in the corresponding gene while white indicated that such mutations were not detected. (b) Column 1 represents the miRNA expression clusters3, Column 2 represents the DNA methylation clusters [H. Shen, personal communication], and Column 3 represents cytogenetic risk category for the AML cohort. For each column, each cluster or category was assigned a distinct color from the D3 color map (https://github.com/mbostock/d3/wiki/Ordinal-Scales), with five clusters for miRNA expression (cluster 1-5) and nine for DNA methylation (cluster 1-9), and three for cytogenetic risk category (favorable, intermediate, poor). A strong concordance is observed between miRNA cluster 3 (orange), DNA methylation cluster 3 (also orange) and intermediate cytogenetic risk (light blue); and between miRNA cluster 5 (green), DNA methylation cluster 5 (also green) and favorable cytogenetic risk (dark blue). (c) The integrated Kaplan-Meier plot confirms that miRNA cluster 5 (green line) has a more favorable overall survival profile. The colors of the lines correspond to the colors of the miRNA clusters. See https://genome-cancer.ucsc.edu/proj/site/hgHeatmap/#?bookmark=sr1.

Full size image

Aggregation and standardization of data and results

Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas

Larsson Omberg et al. Nature Genetics 10.1038/ng.2761

To coordinate all of the investigators working on the same data, standardized collections of data sets were released in the form of 'data freezes', which served as the input for all downstream analysis. Files in the data freeze were intuitively presented to researchers as lists of tab-delimited matrices for each tumor type and experimental platform. As described below, each file in a data freeze was associated with provenance tracking, data versioning, queryable structured meta-data and bindings to multiple analytical clients.

Each processed data set was associated with a provenance record, depicted as a graph of the input data sets and data-processing procedures used to generate the data (Fig. 3).

Figure 11: Example provenance graph of a multistep workflow showing interaction between the analysis of three researchers.
figure 11

The provenance record consists of two types of nodes—activities (shown as red boxes above) performed by a researcher and input and output files of these actions (shown as file and folder icons and identified by their name and Synapse ID). In addition, every activity has metadata associated with it to further describe the details of the actions performed. This specific graph shows the workflow used to perform comparative analysis of two mutation-calling algorithms—MuSiC and MutSig. For MuSiC, the provenance of analysis is displayed from input data to derivation of mutation calls. Provenance records may be further expanded (ellipses) to trace the origin of input files to their original data source in Firehose, DCC or personal communications with AWG members. For brevity, the MutSig graph is not expanded. This graph was produced from version 2 of the data in doi:10.7303/syn1750331.2.

Full size image

In addition to creating capabilities for describing and sharing analysis workflows, the Pan-Cancer group also explored a research model in which independent groups of investigators collaboratively evolved novel analytical methods through the use of automated tools to assess the performance of each approach14. Specifically, the Pan-Cancer group used tools to provide real-time automated assessments, based on common performance metrics, of both 'unsupervised' clustering methods (J.S. et al., unpublished data) and 'supervised' molecular prognostic models of patient survival (Y.Y., E.M. Van Allen, L.O., N. Wagle, A. Sokolov et al., unpublished data).

The data freezes, analysis results and evaluation framework for survival predictions each correspond to a new publicly available resource released in conjunction with this work. First, the curated Pan-Cancer data freezes are now available (syn300013), allowing researchers to easily access well-curated, analysis-ready data sets from the TCGA Research Network. Data freezes will continue to be maintained and updated in future expansions of the Pan-Cancer project. Updates will be immediately available to the community, allowing any researcher to use data from and contribute to the Pan-Cancer project. Second, we are releasing a resource of Pan-Cancer analysis results (syn1895888), containing the results of applying most commonly used algorithms developed throughout the course of the broader TCGA effort.

Visualization and interpretation of cancer genome analysis

IntOGen-mutations identifies cancer drivers across tumor types

Abel Gonzalez-Perez et al. Nature Methods 10.1038/nmeth.2642

The IntOGen-mutations platform (http://www.intogen.org/mutations/) summarizes somatic mutations, genes and pathways involved in tumorigenesis. It identifies and visualizes cancer drivers, analyzing 4,623 exomes from 13 cancer sites. The different modules in the pipeline are executed by a workflow management system (Wok, https://bitbucket.org/bbglab/wok/).

The results of the pipeline are automatically loaded into a Web browser managed by the Onexus framework (Supplementary Fig. 1). IntOGen-mutations will be regularly updated with new cancer genome resequencing data. The results can be browsed through the Web (Supplementary Note 2) and with Gitools interactive heat maps15 (http://www.gitools.org/datasets/). The pipeline may be downloaded and can also be run online on our servers. It can be used to identify drivers from newly sequenced cohorts of tumor samples (Supplementary Note 3) and to interpret the mutations observed in a tumor sample (Supplementary Note 4).

Figure 12: Supplementary Figure 1: Diagram representing three different use cases of the IntOGen-mutations resource.
figure 12

A) The IntOGen-mutations pipeline is used to compute the results of periodically obtained public data and populate the IntOGen web discovery tool. Users can browse the results of the pipeline through the IntOGen-mutations public browser, searching on gene, pathway, cancer type or tumor sequencing project. B) Users employ the IntOGen-mutations pipeline, either through the web server or locally to analyze their own data, thus obtaining a private browser with their results. In the second use case, users input a list of somatic mutations from a cohort of tumor samples to identify putative driver gene mutations. The results can be browsed within the context of accumulated knowledge in IntOGen. C) The pipeline can also be used to rank somatic mutations identified in the tumor of an individual to assess their potential implication in cancer development.

Full size image