Thread 4: Data discovery, transparency and visualization

Lopez-Bigas, Nuria; Cline, Melissa; Broom, Bradley; Margolin, Adam; Omberg, Larsson; Weinstein, John; Axton, Myles

doi:10.1038/ng.2789

Download PDF

Thread
Open access
Published: 17 October 2013

Thread 4: Data discovery, transparency and visualization

Nuria Lopez-Bigas¹,
Melissa Cline²,
Bradley Broom³,
Adam Margolin⁴,
Larsson Omberg⁴,
John Weinstein⁵ &
…
Myles Axton⁶

Nature Genetics , Article number: ng.2789 (2013) Cite this article

2089 Accesses
3 Citations
3 Altmetric
Metrics details

Abstract

Collaborative analysis of multiple large orthogonal data sets presents many challenges. Here are some of the solutions created and used in the Pan-Cancer analysis initiative.

Main

We want to be able to verify data integrity while tracking the provenance and versions of all data and results—as well as the software that has been run on them to generate derivative data sets. Different groups have different platforms, so compatibility, standards and documentation are essential to ensure integrity and to maximize data reuse. In addition, the complexity of the data poses challenges in visualization and how to explore them effectively to extract relevant knowledge.

Heat maps

Heat maps are a mainstay for visualizing clustered or ordered data matrices and identifying groups of related data points. Many tools produce heat maps, but most produce static images with fixed sort order and no option to select which data are displayed, providing no capability to explore the data interactively. Four tools that overcome these limitations are Gitools, interactive heat maps (PLoS One 6, e19541, 2011), Next-Generation Clustered Heat Maps NG-CHM and the UCSC Cancer Genomics Browser.

Gitools heat maps—whose columns and rows normally represent tumor samples and genes respectively—can represent multiple values in each cell, which makes them especially well suited for the representation of multidimensional cancer genomics data. Their interactive capabilities allow the user to filter, sort, move and hide rows and columns in the context of gene and tumor sample annotations and to launch several common exploratory analyses, such as correlations, clustering, enrichment and statistical comparisons between groups of samples. A built-in option allows users to sort the genes and samples within a heat map, following the pattern of the mutual exclusivity of alterations. TCGA Pan-Cancer data are ready to be browsed with Gitools interactive heat maps at http://www.gitools.org/datasets. The Gitools Pan-Cancer multidimensional matrix contains information for 4,678 samples and 22,046 genes, with 4 values per sample and gene, including expression, copy number and mutation data. In addition, numerous clinical and molecular annotations of samples and genes are available to be added in the visualization and used to filter, sort and analyze the data. The candidate cancer driver genes predicted using multiple methods, including MutSig, MuSiC, OncodriveFM, OncodriveCLUST and Activer driver, are also available (Sci. Rep. doi: 10.1038/srep02650). In addition, all the results reported by IntOGen-mutations are also available to be browsed (Nat. Methods doi:10.1038/nmeth.2642).

Digging into TCGA data with interactive graphics

The 1990s saw the introduction of clustered heat maps for omic biology (Science 275, 343–349, 1997). They have since become the most popular way to visualize patterns in molecular profiling data, for example, from microarrays and sequencing technologies. They have been included in all of the TCGA papers published in Nature so far on the cancers of specific tissue origin. Of necessity, however, they have been included in the pages of the journal as static images. We have now developed 'next-generation' clustered heat maps, which use a Google Maps–like tiling technology for extreme zooming and navigation without loss of resolution. Next-generation clustered heat maps provide pathway and Gene Ontology (GO) information, chromosomal interactive ideograms, recoloring on the fly, high-resolution graphics output and linkouts to public information resources (for example, to the cBio Portal) on genes, proteins, pathways and drugs. Perhaps most importantly, all of the metadata elements needed to reproduce them months or years later are captured and automatically saved. The result is a visually rich, dynamic environment for the exploration of the masses of data produced by TCGA. The Compendium of TCGA Pan-Cancer NG-CHMs (Figs. 1, 2a,2b,2c,3,4,5,6) as of August 2013 displayed 676 maps as an initial set, but the numbers will soon rise into the thousands as more data types, tumor types and algorithms are incorporated.

**Figure 1: The Pan-Cancer NG-CHM portal.**

**Figure 3: Figure 2A. A sample versus protein Pan-Cancer NG-CHM for RPPA protein expression.**

**Figure 6: Protein versus protein mRNA expression NG-CHM for the Pan-Cancer set.**

**Figure 7: Entire sample versus gene NG-CHM for renal clear-cell (KIRC) mRNA.**

**Figure 8: Gene versus sample KIRC mRNA NG-CHM zoomed to show a red patch of high expression.**

**Figure 9: Some additional features of NG-CHMs.**

An example: visualizing the pan-cancer phosphoproteome

TCPA: a resource for cancer functional proteomics data

Jun Li et al. Nature Methods 10.1038/nmeth.2650

The Visualization module provides two ways to examine global protein expression patterns in a specific RPPA data set. One is through a “next-generation clustered heat map” (Fig. 1, v), which allows users to zoom, navigate and scrutinize clustering patterns of samples or proteins and link those patterns to relevant biological information sources. The other is through a network view (Fig. 1, vi), which overlays the correlation between any two interacting partners in the protein interaction network (curated in the Human Protein Reference Database8).

**Figure 2: Overview of the TCPA data portal.**

Ordinating cancer information

Cancer can yield many types of alterations, and cancer genomics demands many forms of visualization to best assess them collectively.

Exploring TCGA pan-cancer data at the UCSC Cancer Genomics Browser

Melissa Cline et al. Scientific Reports 10.1038/srep02652

There are many choices for cancer genomics visualization tools, as reviewed3. They fall into three general categories: genome-based, gene-based, and pathway-based. Genome-based methods such as the UCSC Cancer Genomics Browser and IGV (http://www.broadinstitute.org/igv/) are well suited for exploring alterations that follow genomic coordinates, such as copy number variations or DNA methylation profile. However, because genome-based visualizations only allow the user to see one or two genomic regions at once, they are generally not as effective for exploring possible connections between alterations in multiple different genomic regions. Gene-based visualizations are offered by GiTools (http://www.gitools.org/) [as used in the visualization of data by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and Tamborero et al., doi:10.1038/srep02650], cBio Portal (http://www.cbioportal.org/public-portal/) [as used by Ciriello et al., doi:10.1038/ng.2762], IntOGen (http://www.intogen.org) [as described by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and used by Tamborero et al., doi:10.1038/srep02650], IGV, and the UCSC Cancer Genomics Browser. These tools are useful for exploring alterations in multiple genes at once, especially when the set of genes has known functional significance such as known marker genes or genes in a common pathway. The UCSC Cancer Genomics Browser further allows probe-based visualization in its gene-based viewing mode, which is particularly effective for assessing exon expression or DNA methylation data in genes of interest. However, these heatmap based-methods do not generally indicate how genes interact in a pathway. Pathway-based methods such as Cytoscape (http://cytoscape.org/) and cBio Portal allow the user to see known functional connections between genes, but are limited to known pathways (which are limited in coverage) or predicted pathways (which tend to have high error rates). In short, no tool meets all purposes, and all tools have scenarios in which they are most effective.

The UCSC Cancer Genomics Browser presents the TCGA data and Pan-Cancer subtypes in a coherent, integrated system for both TCGA researchers and the scientific community at large. It provides direct access to and visualization of data at specific genes or genomic regions on samples of interest. It displays genomic aberrations alongside clinical and annotation features in a flexible, dynamic display, sorted by the features of the user's selection. Integrated Kaplan-Meier plots allows researchers to assess when change of cancer subtypes coincide with changes in survival.

The power of this approach is illustrated in Figure 1, which shows a heatmap of the somatic mutation profile of the significantly mutated genes in the TCGA acute myeloid leukemia (AML) cohort, as well as the corresponding AML subtype designations for these samples.

**Figure 10: Using the UCSC Cancer Genomics Browser to explore relationships between somatic mutation profiles, genomic subtypes and survival.**

Aggregation and standardization of data and results

Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas

Larsson Omberg et al. Nature Genetics 10.1038/ng.2761

To coordinate all of the investigators working on the same data, standardized collections of data sets were released in the form of 'data freezes', which served as the input for all downstream analysis. Files in the data freeze were intuitively presented to researchers as lists of tab-delimited matrices for each tumor type and experimental platform. As described below, each file in a data freeze was associated with provenance tracking, data versioning, queryable structured meta-data and bindings to multiple analytical clients.

Each processed data set was associated with a provenance record, depicted as a graph of the input data sets and data-processing procedures used to generate the data (Fig. 3).

**Figure 11: Example provenance graph of a multistep workflow showing interaction between the analysis of three researchers.**

In addition to creating capabilities for describing and sharing analysis workflows, the Pan-Cancer group also explored a research model in which independent groups of investigators collaboratively evolved novel analytical methods through the use of automated tools to assess the performance of each approach¹⁴. Specifically, the Pan-Cancer group used tools to provide real-time automated assessments, based on common performance metrics, of both 'unsupervised' clustering methods (J.S. et al., unpublished data) and 'supervised' molecular prognostic models of patient survival (Y.Y., E.M. Van Allen, L.O., N. Wagle, A. Sokolov et al., unpublished data).

The data freezes, analysis results and evaluation framework for survival predictions each correspond to a new publicly available resource released in conjunction with this work. First, the curated Pan-Cancer data freezes are now available (syn300013), allowing researchers to easily access well-curated, analysis-ready data sets from the TCGA Research Network. Data freezes will continue to be maintained and updated in future expansions of the Pan-Cancer project. Updates will be immediately available to the community, allowing any researcher to use data from and contribute to the Pan-Cancer project. Second, we are releasing a resource of Pan-Cancer analysis results (syn1895888), containing the results of applying most commonly used algorithms developed throughout the course of the broader TCGA effort.

Visualization and interpretation of cancer genome analysis

IntOGen-mutations identifies cancer drivers across tumor types

Abel Gonzalez-Perez et al. Nature Methods 10.1038/nmeth.2642

The IntOGen-mutations platform (http://www.intogen.org/mutations/) summarizes somatic mutations, genes and pathways involved in tumorigenesis. It identifies and visualizes cancer drivers, analyzing 4,623 exomes from 13 cancer sites. The different modules in the pipeline are executed by a workflow management system (Wok, https://bitbucket.org/bbglab/wok/).

The results of the pipeline are automatically loaded into a Web browser managed by the Onexus framework (Supplementary Fig. 1). IntOGen-mutations will be regularly updated with new cancer genome resequencing data. The results can be browsed through the Web (Supplementary Note 2) and with Gitools interactive heat maps¹⁵ (http://www.gitools.org/datasets/). The pipeline may be downloaded and can also be run online on our servers. It can be used to identify drivers from newly sequenced cohorts of tumor samples (Supplementary Note 3) and to interpret the mutations observed in a tumor sample (Supplementary Note 4).

**Figure 12: Supplementary Figure 1: Diagram representing three different use cases of the IntOGen-mutations resource.**

Author information

Authors and Affiliations

University Pompeu Fabra,
Nuria Lopez-Bigas
University of California, Santa Cruz,
Melissa Cline
University of Texas MD Anderson Cancer Center,
Bradley Broom
Sage Bionetworks,
Adam Margolin & Larsson Omberg
University of Texas MD Anderson GDAC,
John Weinstein
Nature Genetics,
Myles Axton

Authors

Nuria Lopez-Bigas
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Cline
View author publications
You can also search for this author in PubMed Google Scholar
Bradley Broom
View author publications
You can also search for this author in PubMed Google Scholar
Adam Margolin
View author publications
You can also search for this author in PubMed Google Scholar
Larsson Omberg
View author publications
You can also search for this author in PubMed Google Scholar
John Weinstein
View author publications
You can also search for this author in PubMed Google Scholar
Myles Axton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nuria Lopez-Bigas, Melissa Cline, Bradley Broom, Adam Margolin, Larsson Omberg, John Weinstein or Myles Axton.

Supplementary information

Supplementary Text and Figures for 10.1038/nmeth.2642 (PDF 4548 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.

Reprints and permissions

About this article

Cite this article

Lopez-Bigas, N., Cline, M., Broom, B. et al. Thread 4: Data discovery, transparency and visualization. Nat Genet ng.2789 (2013). https://doi.org/10.1038/ng.2789

Download citation

Published: 17 October 2013
DOI: https://doi.org/10.1038/ng.2789

This article is cited by

Long Non-Coding RNA SNHG6 as a Potential Biomarker for Hepatocellular Carcinoma
- Maryam Tahmasebi Birgani
- Mohammadreza Hajjari
- Baharak Farhangi
Pathology & Oncology Research (2018)

Thread 4: Data discovery, transparency and visualization

Abstract

Main

Heat maps

Digging into TCGA data with interactive graphics

An example: visualizing the pan-cancer phosphoproteome

TCPA: a resource for cancer functional proteomics data

Ordinating cancer information

Exploring TCGA pan-cancer data at the UCSC Cancer Genomics Browser

Aggregation and standardization of data and results

Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas

Visualization and interpretation of cancer genome analysis

IntOGen-mutations identifies cancer drivers across tumor types

Author information

Authors and Affiliations

Corresponding authors

Supplementary information

Supplementary Text and Figures for 10.1038/nmeth.2642 (PDF 4548 kb)

Rights and permissions

About this article

Cite this article

This article is cited by

Long Non-Coding RNA SNHG6 as a Potential Biomarker for Hepatocellular Carcinoma

Search

Quick links

Abstract

Main

Heat maps

Digging into TCGA data with interactive graphics

An example: visualizing the pan-cancer phosphoproteome

TCPA: a resource for cancer functional proteomics data

Ordinating cancer information

Exploring TCGA pan-cancer data at the UCSC Cancer Genomics Browser

Aggregation and standardization of data and results

Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas

Visualization and interpretation of cancer genome analysis

IntOGen-mutations identifies cancer drivers across tumor types

Author information

Authors and Affiliations

Corresponding authors

Supplementary information

Supplementary Text and Figures for 10.1038/nmeth.2642 (PDF 4548 kb)

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Long Non-Coding RNA SNHG6 as a Potential Biomarker for Hepatocellular Carcinoma

Search

Quick links