Main

The genomes of all cancers accumulate somatic mutations1. These include nucleotide substitutions, small insertions and deletions, chromosomal rearrangements and copy number changes that can affect protein-coding or regulatory components of genes. In addition, cancer genomes usually acquire somatic epigenetic ‘marks’ compared to non-neoplastic tissues from the same organ, notably changes in the methylation status of cytosines at CpG dinucleotides.

A subset of the somatic mutations in cancer cells confers oncogenic properties such as growth advantage, tissue invasion and metastasis, angiogenesis, and evasion of apoptosis2. These are termed ‘driver’ mutations. The identification of driver mutations will provide insights into cancer biology and highlight new drug targets and diagnostic tests. Knowledge of cancer mutations has already led to the development of specific therapies, such as trastuzumab for HER2 (also known as NEU or ERBB2)-positive breast cancers3 and imatinib, which targets BCR-ABL tyrosine kinase for the treatment of chronic myeloid leukaemia4,5. The remaining somatic mutations in cancer genomes that do not contribute to cancer development are called ‘passengers’. These mutations provide insights into the DNA damage and repair processes that have been operative during cancer development, including exogenous environmental exposures6,7. In most cancer genomes, it is anticipated that passenger mutations, as well as germline variants not yet catalogued in polymorphism databases, will substantially outnumber drivers.

Large-scale analyses of genes in tumours have shown that the mutation load in cancer is abundant and heterogeneous8,9,10,11,12,13. Preliminary surveys of cancer genomes have already demonstrated their relevance in identifying new cancer genes that constitute potential therapeutic targets for several types of cancer, including PIK3CA14, BRAF15, NF1 (ref. 10), KDR10, PIK3R1 (ref. 9), and histone methyltransferases and demethylases16,17. These projects have also yielded correlations between cancer mutations and prognosis, such as IDH1 and IDH2 mutations in several types of gliomas13,18. Advances in massively parallel sequencing technology have enabled sequencing of entire cancer genomes19,20,21,22.

Following the launch of comprehensive cancer genome projects in the United Kingdom (Cancer Genome Project)23 and the United States (The Cancer Genome Atlas)24, cancer genome scientists and funding agencies met in Toronto (Canada) in October 2007 to discuss the opportunity to launch an international consortium. Key reasons for its formation were: (1) the scope is huge; (2) independent cancer genome initiatives could lead to duplication of effort or incomplete studies; (3) lack of standardization across studies could diminish the opportunities to merge and compare data sets; (4) the spectrum of many cancers is known to vary across the world; and (5) an international consortium will accelerate the dissemination of data sets and analytical methods into the user community.

Working groups were created to develop strategies and policies that would form the basis for participation in the ICGC. The goals of the consortium (Box 1) were released in April 2008 (http://www.icgc.org/files/ICGC_April_29_2008.pdf). Since then, working groups and initial member projects have further refined the policies and plans for international collaboration.

Bioethical framework

ICGC members agreed to a core set of bioethical elements for consent as a precondition of membership (Box 2). The Ethics and Policy Committee has created patient consent templates for both prospective collection and retrospective use of samples and data for ICGC projects. Differences in project-specific requirements and national legal frameworks may require some local amendments, while still reflecting the core principles of ICGC.

The ICGC recognizes a delicate balance between protecting participants’ personal data and sharing these data to accelerate cancer research. Data access policies have been drawn up that are respectful of the rights of the donors, while allowing ICGC data derived from samples to be shared ethically among a wide research community. Two levels of access have been implemented. For data that cannot be used to identify individuals, ‘open access’ data sets are publicly available. These include data such as gender, age range, histology, normalized gene expression values, epigenetic data sets, somatic mutations, summaries of germline data, and study protocols. ‘Controlled access’ data sets contain germline genomic data and detailed clinical information that are associated to a unique individual whose personal identifiers have been removed. To access controlled data sets researchers must seek authorizations by contacting the Data Access Compliance Office (DACO) (http://www.icgc.org/daco). An independent International Data Access Committee (IDAC) oversees the work of the DACO and provides assistance with resolving issues that arise.

Pathology and clinical annotation

Large-scale genomic studies of human tumours rely on the availability of freshly frozen tumour tissue. To address the paucity of samples that meet ICGC standards, many projects have initiated prospective collections of high-quality source material. Accordingly, the ICGC recommended procedures to promote consistency of sample processing throughout the consortium and ensure a series of quality features such as high tissue integrity and tumour cell content. Each project will need to include diverse data types, such as environmental exposures, clinical history of participants, tumour histopathology, and clinical outcomes.

Tumours show considerable clinical and biological heterogeneity that has resulted in a variety of tumour classifications. Within the ICGC, special measures are taken to promote the consistency of diagnosis. These include the coordination of diagnostic criteria among groups investigating tumours that are related, and policies that all samples will be reviewed by at least two independent reference pathologists. Furthermore, images of the stained tumour sections (or blood smear or cytospins for haematological neoplasias) from which diagnoses were made, will be stored and made available to the community.

Although different tumour types may require specific procedures for tumour acquisition or compilation of clinical and environmental data, the ICGC has set guidelines about the use of common definitions and data standards. This will allow ICGC data users to identify correlations between tumour-specific molecular changes with clinical and histopathological data including prognosis, prediction of therapy response and tumour classification schemes for diagnosis.

Study design and statistical issues

To identify cancer-related genes, one needs to detect genes that are mutated at a higher frequency than the background mutation rate. Given that several driver genes have been found to be mutated at low frequencies, the ICGC will identify somatic mutations observed in at least 3% of tumours of a given subtype. The ICGC determined that 500 samples would be needed per tumour type (although for rare tumour types, a smaller sample size may be justified). In practice, the degree of heterogeneity of a given tumour type is difficult to know in advance, such that some particularly heterogeneous tumour types may require larger sample collections.

Cancer genome analyses

High-quality catalogues of somatic mutations from whole cancer genomes will ultimately be the ICGC standard. Shotgun sequencing using second generation technologies can detect all classes of somatic mutation implicated in cancer. Moreover, if the level of coverage is sufficient, comprehensive high-quality catalogues of somatic mutations from individual cancer genomes can be acquired with >90% sensitivity and >95% specificity. To achieve this, it will be necessary to sequence the genome of both the cancer and a normal tissue from the same individual to distinguish germline variants. Although a few genomes of this standard have already been generated, the cost and the continuing technology development will mean that interim analyses of particularly informative sectors of the genome will be carried out, for example of all coding exons and microRNAs.

For each individual cancer genome, the catalogue of somatic mutations will be supplemented by genome-wide information on the state of methylation of CpG dinucleotides. The optimal strategies and technologies to achieve this are not yet clear. Moreover, the genomes of individual cancers will be accompanied, where possible, by analyses of the transcriptome. Although conventional array-based approaches predominate at present, it is preferable that RNA sequencing becomes the standard as sequencing has a greater dynamic range25 and provides further information including new transcripts and sequence variants26.

ICGC data sets

The distributed nature of the consortium coupled with the large size of the data sets makes it cumbersome to store all data in a single centralized repository. For this reason, the ICGC has adopted a ‘franchise’ database model for integrating the information and making it available to the public. Under this model, each member project releases tumour information by copying it into its local franchise database after it has been quality checked. Each franchise database shares a common schema to describe the specimens, the associated clinical information, and their genome characterization data. ICGC primary data files, are sent to the National Center for Biotechnology Information (NCBI) and/or the European Bioinformatics Institute (EBI) for archiving, while interpreted data sets, such as somatic mutation calls, are stored in franchise databases. The ICGC franchise databases and web portal use BioMart27, a data federation technology originally developed for use in Ensembl28, and since adopted for use by several model organism and genome databases. The management of the ICGC data flow is the responsibility of the ICGC Data Coordination Center (DCC) located at the Ontario Institute for Cancer Research.

The DCC also operates the ICGC data portal that allows researchers to access both open and controlled access portions of the ICGC data. The portal provides a variety of user interfaces that range from simple gene-oriented queries (‘show me all the non-silent coding mutations identified in PIK3R1 for all cancers’) to queries that integrate genomic, clinical and functional information (‘show me all members of the Toll-receptor pathway having deletions in stage III breast cancer’). These queries will be distributed across the franchise databases in a manner that is invisible to the user. The portal will also provide links to the primary files at the NCBI and EBI, interfaces for generating tabular reports, data dumps in common bioinformatics formats, and other visualizations including genome browser tracks, pathway diagrams and survival curves. The portal is available via a link at http://www.icgc.org.

At the time of this publication, the following cancer and reference data sets will be available through the ICGC web portal: (1) initial data releases from ICGC members for breast cancer (UK), liver cancer (Japan), and pancreatic cancer (Australia and Canada); (2) a whole genome data set of a metastatic melanoma cell line (COLO829)6; (3) open data sets from The Cancer Genome Atlas (TCGA) for glioblastoma multiforme (GBM) and serous cystadenocarcinoma of the ovary (see later); (4) whole exome somatic mutation data from 68 individuals with breast, colorectal, pancreatic cancer and GBM11,12,13; (5) links to the human reference genome (http://www.genomereference.org/) and gene annotations from the GENCODE project (http://www.sanger.ac.uk/gencode/) that includes the CCDS gene set29; (6) links to the single nucleotide polymorphism database (dbSNP)30 and the HapMap31 databases, providing access to common patterns of variation in reference population samples; (7) links to Reactome32, a curated database of biological pathways in human; and (8) a set of reference gene models, mirrored from ENSEMBL28.

The current version of the web portal provides an entry point to the open access data tier by interactive query as well as bulk download of data files. We expect that in mid-2010 both open access and controlled data will be available.

The ICGC recently established a bioinformatics analysis working group to compare pipelines, analytic methods, consistency within and among algorithms, and establish guidelines or best practices for the consortium. Over time, significant resources will be deployed to develop strategies to analyse the large complex data sets generated by ICGC member projects, and provide value-added views of cancer genomic data by integrating them with other biological and epidemiological data sets.

Data release and intellectual property policies

The data release policies of the ICGC are intended to maximize public benefit while, at the same time, protecting the interests and rights of sample donors and their relatives. Members of the ICGC are committed to the principles of rapid data release (with appropriate controlled access mechanisms), in concordance with the Toronto statement33. ICGC members encourage the scientific community to use any data that targets specific genes and mutations, without any restrictions. To allow ICGC members the opportunity to be the first to publish global analyses from data sets they generate, the consortium has also agreed that member projects may specify conditions that include a time limit during which other data users are asked to refrain from publishing global analyses (defined by several ICGC member projects as 100 tumours and matched controls), a provision referred to as a ‘publication moratorium’. To allow time for a data set to be analysed and submitted for publication, ICGC members will have at most one year after released data sets reach the specified threshold before third parties are permitted to submit manuscripts describing global analyses. Further details on data release guidelines for data producers, users and reviewers are available http://www.icgc.org. Users of ICGC data are expected to respect these terms and to cite this manuscript and the source of pre-publication data, including the version of the data set. In cases of uncertainty, scientists using ICGC data are encouraged to contact the member projects to discuss publication plans.

ICGC members believe that maximum public benefit will be achieved if the data remain publicly accessible without patent restrictions, hence no claims to possible intellectual property derived from primary data (including somatic mutations) will be made. Users of ICGC data (including ICGC members) may elect to perform further research and to exercise their intellectual property rights on these downstream discoveries. If this occurs, users are expected to implement licensing policies that do not obstruct further research.

Initial ICGC projects

At present, ten countries and two European consortia have initiated cancer genome projects under the umbrella of the ICGC. The initial projects, listed in Supplementary Table 1, will analyse tumour types found around the globe and throughout the human body affecting a diversity of organs, including blood, brain, breast, kidney, liver, pancreas, stomach, oral cavity and ovary. Over time, the ICGC will investigate 50 or more types and subtypes of cancer in adults and children. In the case of tumours with several subtypes, analyses should be focused on subtypes that may be defined on pathological, molecular, aetiological or geographical differences. It is expected that some cancer types will be studied in parallel in different parts of the world, as the mutation profiles may differ among populations. The consortium has enabled the coordination of initial projects analysing similar cancers in different countries, and in some cases, the redirection of resources to launch new projects.

The Cancer Genome Atlas

TCGA is a comprehensive program in cancer genomics that is jointly supported and managed by the National Cancer Institute and the National Human Genome Research Institute of the US National Institutes of Health. TCGA began in 2006 as a pilot focused on three projects, glioblastoma multiforme (GBM), serous cystadenocarcinoma of the ovary, and lung squamous carcinoma, and has recently expanded to produce comprehensive genomic data sets for at least ten other cancers in the next two years. Given TCGA’s contributions in launching the ICGC and cooperation to ensure that its policies (posted at http://cancergenome.nih.gov) are coordinated with those of the ICGC, TCGA’s participation in the ICGC is considered to be equivalent to that of a full member. TCGA, however, is not able to join the ICGC formally at this time, because of technical and legal issues in the US related to the mechanisms of the distribution of controlled-access data, although such data are directly available to investigators at http://cancergenome.nih.gov/dataportal. The National Institutes of Health (NIH) policies relating to distribution of controlled-access data sets are being reviewed with the intent of enabling researchers to integrate and analyse across databases, for example, using the franchise model adopted by the ICGC. Meanwhile, TCGA is ensuring that projects are coordinated and data sets are compatible with those of the consortium.

ICGC in the next decade

A large proportion of common cancers affecting patients around the world have been or will soon be selected for comprehensive cancer genome studies. Further efforts will be needed to leverage support and expertise to tackle the remaining tumour types, including rare cancers. The challenges of the ICGC are daunting owing to the scope of the initiative, the complexity that is inherent to the heterogeneity of cancer, and the limitations of current technologies to provide accurate long-range assemblies of highly rearranged chromosomes found in tumour cells. These challenges underscore the importance of continued international coordination and further engagement of the scientific community in the next decade.

Moving towards clinical applications

ICGC catalogues, which are expected to grow exponentially, will have immediate relevance in the cancer research community. Early insight into the biology of somatic mutations will come from functional studies in cell-based and animal models of tumours. Mutation screens in retrospective tumour banks linked to registries or clinical trials having significant clinical data will inform on the potential clinical utility of somatic mutations as biomarkers for prognosis or drug-response. Germline variants identified by ICGC projects may allow the discovery of genes predisposing to familial malignancies, such as PALB2 and pancreatic cancer12,34. High throughput screens of RNA interference or small molecule libraries, and the adaptation of existing model systems, will have a major role in refining potential therapeutic candidates for further study35.

Translating these discoveries into clinical practice will require more sophisticated clinical trials that take into account the increases in phenotypic subdivisions, further coordination to identify subjects having tumours with similar profiles, and increased use of biomarkers, genomic analyses, informatics and other technologies in the clinical development of new therapeutics. Given the tremendous potential for relatively low-cost genomic sequencing to reveal clinically useful information, we anticipate that in the not so distant future, partial or full cancer genomes will routinely be sequenced as part of the clinical evaluation of cancer patients and as part of their continuing clinical management. The successful and appropriate translation of cancer genome research into clinical practice will raise important social and ethical questions. It will be essential to combine the expertise of oncologists, biostatisticians, pathologists, geneticists, policy-makers and members of the biopharmaceutical industry to meet this challenge by developing new policies and clinical models that enable rapid translation of many new biomarkers and cancer targets into new clinical tests and therapeutic interventions that will benefit cancer patients.