Abstract
Collaborative analysis of multiple large orthogonal data sets presents many challenges. Here are some of the solutions created and used in the Pan-Cancer analysis initiative.
Main
We want to be able to verify data integrity while tracking the provenance and versions of all data and results—as well as the software that has been run on them to generate derivative data sets. Different groups have different platforms, so compatibility, standards and documentation are essential to ensure integrity and to maximize data reuse. In addition, the complexity of the data poses challenges in visualization and how to explore them effectively to extract relevant knowledge.
Heat maps
Heat maps are a mainstay for visualizing clustered or ordered data matrices and identifying groups of related data points. Many tools produce heat maps, but most produce static images with fixed sort order and no option to select which data are displayed, providing no capability to explore the data interactively. Four tools that overcome these limitations are Gitools, interactive heat maps (PLoS One 6, e19541, 2011), Next-Generation Clustered Heat Maps NG-CHM and the UCSC Cancer Genomics Browser.
Gitools heat maps—whose columns and rows normally represent tumor samples and genes respectively—can represent multiple values in each cell, which makes them especially well suited for the representation of multidimensional cancer genomics data. Their interactive capabilities allow the user to filter, sort, move and hide rows and columns in the context of gene and tumor sample annotations and to launch several common exploratory analyses, such as correlations, clustering, enrichment and statistical comparisons between groups of samples. A built-in option allows users to sort the genes and samples within a heat map, following the pattern of the mutual exclusivity of alterations. TCGA Pan-Cancer data are ready to be browsed with Gitools interactive heat maps at http://www.gitools.org/datasets. The Gitools Pan-Cancer multidimensional matrix contains information for 4,678 samples and 22,046 genes, with 4 values per sample and gene, including expression, copy number and mutation data. In addition, numerous clinical and molecular annotations of samples and genes are available to be added in the visualization and used to filter, sort and analyze the data. The candidate cancer driver genes predicted using multiple methods, including MutSig, MuSiC, OncodriveFM, OncodriveCLUST and Activer driver, are also available (Sci. Rep. doi: 10.1038/srep02650). In addition, all the results reported by IntOGen-mutations are also available to be browsed (Nat. Methods doi:10.1038/nmeth.2642).
Digging into TCGA data with interactive graphics
The 1990s saw the introduction of clustered heat maps for omic biology (Science 275, 343–349, 1997). They have since become the most popular way to visualize patterns in molecular profiling data, for example, from microarrays and sequencing technologies. They have been included in all of the TCGA papers published in Nature so far on the cancers of specific tissue origin. Of necessity, however, they have been included in the pages of the journal as static images. We have now developed 'next-generation' clustered heat maps, which use a Google Maps–like tiling technology for extreme zooming and navigation without loss of resolution. Next-generation clustered heat maps provide pathway and Gene Ontology (GO) information, chromosomal interactive ideograms, recoloring on the fly, high-resolution graphics output and linkouts to public information resources (for example, to the cBio Portal) on genes, proteins, pathways and drugs. Perhaps most importantly, all of the metadata elements needed to reproduce them months or years later are captured and automatically saved. The result is a visually rich, dynamic environment for the exploration of the masses of data produced by TCGA. The Compendium of TCGA Pan-Cancer NG-CHMs (Figs. 1, 2a,2b,2c,3,4,5,6) as of August 2013 displayed 676 maps as an initial set, but the numbers will soon rise into the thousands as more data types, tumor types and algorithms are incorporated.
An example: visualizing the pan-cancer phosphoproteome
TCPA: a resource for cancer functional proteomics data
Jun Li et al. Nature Methods 10.1038/nmeth.2650
The Visualization module provides two ways to examine global protein expression patterns in a specific RPPA data set. One is through a “next-generation clustered heat map” (Fig. 1, v), which allows users to zoom, navigate and scrutinize clustering patterns of samples or proteins and link those patterns to relevant biological information sources. The other is through a network view (Fig. 1, vi), which overlays the correlation between any two interacting partners in the protein interaction network (curated in the Human Protein Reference Database8).
Ordinating cancer information
Cancer can yield many types of alterations, and cancer genomics demands many forms of visualization to best assess them collectively.
Exploring TCGA pan-cancer data at the UCSC Cancer Genomics Browser
Melissa Cline et al. Scientific Reports 10.1038/srep02652
There are many choices for cancer genomics visualization tools, as reviewed3. They fall into three general categories: genome-based, gene-based, and pathway-based. Genome-based methods such as the UCSC Cancer Genomics Browser and IGV (http://www.broadinstitute.org/igv/) are well suited for exploring alterations that follow genomic coordinates, such as copy number variations or DNA methylation profile. However, because genome-based visualizations only allow the user to see one or two genomic regions at once, they are generally not as effective for exploring possible connections between alterations in multiple different genomic regions. Gene-based visualizations are offered by GiTools (http://www.gitools.org/) [as used in the visualization of data by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and Tamborero et al., doi:10.1038/srep02650], cBio Portal (http://www.cbioportal.org/public-portal/) [as used by Ciriello et al., doi:10.1038/ng.2762], IntOGen (http://www.intogen.org) [as described by Gonzalez-Perez et al., doi:10.1038/nmeth.2642 and used by Tamborero et al., doi:10.1038/srep02650], IGV, and the UCSC Cancer Genomics Browser. These tools are useful for exploring alterations in multiple genes at once, especially when the set of genes has known functional significance such as known marker genes or genes in a common pathway. The UCSC Cancer Genomics Browser further allows probe-based visualization in its gene-based viewing mode, which is particularly effective for assessing exon expression or DNA methylation data in genes of interest. However, these heatmap based-methods do not generally indicate how genes interact in a pathway. Pathway-based methods such as Cytoscape (http://cytoscape.org/) and cBio Portal allow the user to see known functional connections between genes, but are limited to known pathways (which are limited in coverage) or predicted pathways (which tend to have high error rates). In short, no tool meets all purposes, and all tools have scenarios in which they are most effective.
The UCSC Cancer Genomics Browser presents the TCGA data and Pan-Cancer subtypes in a coherent, integrated system for both TCGA researchers and the scientific community at large. It provides direct access to and visualization of data at specific genes or genomic regions on samples of interest. It displays genomic aberrations alongside clinical and annotation features in a flexible, dynamic display, sorted by the features of the user's selection. Integrated Kaplan-Meier plots allows researchers to assess when change of cancer subtypes coincide with changes in survival.
The power of this approach is illustrated in Figure 1, which shows a heatmap of the somatic mutation profile of the significantly mutated genes in the TCGA acute myeloid leukemia (AML) cohort, as well as the corresponding AML subtype designations for these samples.
Aggregation and standardization of data and results
Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas
Larsson Omberg et al. Nature Genetics 10.1038/ng.2761
To coordinate all of the investigators working on the same data, standardized collections of data sets were released in the form of 'data freezes', which served as the input for all downstream analysis. Files in the data freeze were intuitively presented to researchers as lists of tab-delimited matrices for each tumor type and experimental platform. As described below, each file in a data freeze was associated with provenance tracking, data versioning, queryable structured meta-data and bindings to multiple analytical clients.
Each processed data set was associated with a provenance record, depicted as a graph of the input data sets and data-processing procedures used to generate the data (Fig. 3).
In addition to creating capabilities for describing and sharing analysis workflows, the Pan-Cancer group also explored a research model in which independent groups of investigators collaboratively evolved novel analytical methods through the use of automated tools to assess the performance of each approach14. Specifically, the Pan-Cancer group used tools to provide real-time automated assessments, based on common performance metrics, of both 'unsupervised' clustering methods (J.S. et al., unpublished data) and 'supervised' molecular prognostic models of patient survival (Y.Y., E.M. Van Allen, L.O., N. Wagle, A. Sokolov et al., unpublished data).
The data freezes, analysis results and evaluation framework for survival predictions each correspond to a new publicly available resource released in conjunction with this work. First, the curated Pan-Cancer data freezes are now available (syn300013), allowing researchers to easily access well-curated, analysis-ready data sets from the TCGA Research Network. Data freezes will continue to be maintained and updated in future expansions of the Pan-Cancer project. Updates will be immediately available to the community, allowing any researcher to use data from and contribute to the Pan-Cancer project. Second, we are releasing a resource of Pan-Cancer analysis results (syn1895888), containing the results of applying most commonly used algorithms developed throughout the course of the broader TCGA effort.
Visualization and interpretation of cancer genome analysis
IntOGen-mutations identifies cancer drivers across tumor types
Abel Gonzalez-Perez et al. Nature Methods 10.1038/nmeth.2642
The IntOGen-mutations platform (http://www.intogen.org/mutations/) summarizes somatic mutations, genes and pathways involved in tumorigenesis. It identifies and visualizes cancer drivers, analyzing 4,623 exomes from 13 cancer sites. The different modules in the pipeline are executed by a workflow management system (Wok, https://bitbucket.org/bbglab/wok/).
The results of the pipeline are automatically loaded into a Web browser managed by the Onexus framework (Supplementary Fig. 1). IntOGen-mutations will be regularly updated with new cancer genome resequencing data. The results can be browsed through the Web (Supplementary Note 2) and with Gitools interactive heat maps15 (http://www.gitools.org/datasets/). The pipeline may be downloaded and can also be run online on our servers. It can be used to identify drivers from newly sequenced cohorts of tumor samples (Supplementary Note 3) and to interpret the mutations observed in a tumor sample (Supplementary Note 4).
Author information
Authors and Affiliations
Corresponding authors
Supplementary information
Rights and permissions
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.
About this article
Cite this article
Lopez-Bigas, N., Cline, M., Broom, B. et al. Thread 4: Data discovery, transparency and visualization. Nat Genet ng.2789 (2013). https://doi.org/10.1038/ng.2789
Published:
DOI: https://doi.org/10.1038/ng.2789
This article is cited by
-
Long Non-Coding RNA SNHG6 as a Potential Biomarker for Hepatocellular Carcinoma
Pathology & Oncology Research (2018)