A comprehensive catalogue of somatic mutations from a human cancer genome

Pleasance, Erin D.; Cheetham, R. Keira; Stephens, Philip J.; McBride, David J.; Humphray, Sean J.; Greenman, Chris D.; Varela, Ignacio; Lin, Meng-Lay; Ordóñez, Gonzalo R.; Bignell, Graham R.; Ye, Kai; Alipaz, Julie; Bauer, Markus J.; Beare, David; Butler, Adam; Carter, Richard J.; Chen, Lina; Cox, Anthony J.; Edkins, Sarah; Kokko-Gonzales, Paula I.; Gormley, Niall A.; Grocock, Russell J.; Haudenschild, Christian D.; Hims, Matthew M.; James, Terena; Jia, Mingming; Kingsbury, Zoya; Leroy, Catherine; Marshall, John; Menzies, Andrew; Mudie, Laura J.; Ning, Zemin; Royce, Tom; Schulz-Trieglaff, Ole B.; Spiridou, Anastassia; Stebbings, Lucy A.; Szajkowski, Lukasz; Teague, Jon; Williamson, David; Chin, Lynda; Ross, Mark T.; Campbell, Peter J.; Bentley, David R.; Futreal, P. Andrew; Stratton, Michael R.

doi:10.1038/nature08658

Download PDF

Article
Open access
Published: 16 December 2009

A comprehensive catalogue of somatic mutations from a human cancer genome

Erin D. Pleasance¹^na1,
R. Keira Cheetham²^na1,
Philip J. Stephens¹,
David J. McBride¹,
Sean J. Humphray²,
Chris D. Greenman¹,
Ignacio Varela¹,
Meng-Lay Lin¹,
Gonzalo R. Ordóñez¹,
Graham R. Bignell¹,
Kai Ye³,
Julie Alipaz⁴,
Markus J. Bauer²,
David Beare¹,
Adam Butler¹,
Richard J. Carter²,
Lina Chen¹,
Anthony J. Cox²,
Sarah Edkins¹,
Paula I. Kokko-Gonzales²,
Niall A. Gormley²,
Russell J. Grocock²,
Christian D. Haudenschild⁵,
Matthew M. Hims²,
Terena James²,
Mingming Jia¹,
Zoya Kingsbury²,
Catherine Leroy¹,
John Marshall¹,
Andrew Menzies¹,
Laura J. Mudie¹,
Zemin Ning¹,
Tom Royce⁴,
Ole B. Schulz-Trieglaff²,
Anastassia Spiridou²,
Lucy A. Stebbings¹,
Lukasz Szajkowski²,
Jon Teague¹,
David Williamson⁵,
Lynda Chin⁶,
Mark T. Ross²,
Peter J. Campbell¹,
David R. Bentley²,
P. Andrew Futreal¹ &
…
Michael R. Stratton^1,7

Nature volume 463, pages 191–196 (2010)Cite this article

34k Accesses
1217 Citations
101 Altmetric
Metrics details

This article has been updated

Abstract

All cancers carry somatic mutations. A subset of these somatic alterations, termed driver mutations, confer selective growth advantage and are implicated in cancer development, whereas the remainder are passengers. Here we have sequenced the genomes of a malignant melanoma and a lymphoblastoid cell line from the same person, providing the first comprehensive catalogue of somatic mutations from an individual cancer. The catalogue provides remarkable insights into the forces that have shaped this cancer genome. The dominant mutational signature reflects DNA damage due to ultraviolet light exposure, a known risk factor for malignant melanoma, whereas the uneven distribution of mutations across the genome, with a lower prevalence in gene footprints, indicates that DNA repair has been preferentially deployed towards transcribed regions. The results illustrate the power of a cancer genome sequence to reveal traces of the DNA damage, repair, mutation and selection processes that were operative years before the cancer became symptomatic.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

A single-cell atlas enables mapping of homeostatic cellular shifts in the adult human breast

Article Open access 28 March 2024

Austin D. Reed, Sara Pensa, … Walid T. Khaled

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Main

The genomes of all cancer cells carry somatic mutations¹. These may include base substitutions, small insertions and deletions (indels), rearrangements and copy number alterations together with epigenetic changes. Some of these somatic alterations, known as driver mutations, confer selective clonal growth advantage and are causally implicated in oncogenesis. By definition, these are found in cancer genes. The remainder are passengers which do not contribute to cancer development. However, passenger mutations bear the imprints of the mutational mechanisms that have generated them, unsullied by processes of selection, and thus provide insights into the aetiology and pathogenesis of cancer.

Over the last quarter of a century several strategies have been used to detect the various classes of somatic mutation in cancer genomes¹. As a result, approximately 400 cancer genes have been identified^1,2 (http://www.sanger.ac.uk/genetics/CGP/Census/) and somatic mutations from thousands of tumours have provided insights into the mutational processes operative in human cancer^1,3.

With the advent of the human genome sequence a new strategy was proposed^1,4. Systematic sequencing would identify all somatic mutations of all classes in individual cancer genomes, yielding complete catalogues of somatic mutation. Technological limitations initially constrained this to polymerase chain reaction (PCR)-based re-sequencing of the coding exons of protein-coding genes in order to find base substitutions and small indels^5,6,7,8,9,10. Recently, however, several novel technologies have been developed^11,12. These allow sequencing of randomly generated DNA fragments from cancer genomes and thus detect rearrangements and copy number changes as well as base substitutions and small indels, providing sufficient coverage to identify most somatic mutations in an individual cancer genome. These technologies have previously been used to reveal missense mutations in the coding sequences of two acute myeloid leukaemia genomes^13,14. Here, we report the first comprehensive catalogue of somatic mutations from a cancer genome.

The catalogue of somatic mutations

COLO-829 is an immortal, publicly available cancer cell line derived, before treatment, from a metastasis of a malignant melanoma in a 43-year-old male¹⁵. No skin primary was identified. Using Illumina GAII genome analysers, we obtained more than 40-fold average haploid genome coverage by aligned sequence from COLO-829 and 32-fold from COLO-829BL, a lymphoblastoid line derived from the same patient (Supplementary Fig. 1). Differences from the reference sequence were called in both cell lines. The variant set obtained from COLO-829BL was then subtracted from that of COLO-829 to establish the catalogue of somatic mutations in COLO-829.

We identified 33,345 somatic base substitutions. A total of 32,325 were single-base and 510 were double-base substitutions (in which two adjacent bases show somatic mutations) (Table 1, Fig. 1 and Supplementary Table 1). Of 48 already known somatic substitutions, 42 were present in the whole-genome catalogue of mutations, a sensitivity of 88%. Of 470 newly found somatic substitutions that were assessed, 454 (97%) were confirmed by conventional sequencing, indicating a 3% false-positive rate.

Table 1 Somatic mutations identified in COLO-829

Full size table

Figure 1: **The catalogue of somatic mutations in COLO-829.**

A total of 680 small deletions and 303 small insertions were predicted, of which 182 were evaluated and 66 (36%) confirmed. Thus the false-positive rate for insertions and deletions was higher than for substitutions. We were unable to estimate sensitivity as only a single small deletion had previously been reported in COLO-829. This was not called in the whole-genome sequence, although it was present in a subset of reads. Most confirmed somatic insertions and deletions were of single bases (Supplementary Table 2).

To detect rearrangements we searched for fragments in which the sequence reads from the two ends mapped discordantly to the genome¹⁶. A total of 51 somatic rearrangements were predicted. We assessed the sensitivity of detection by evaluating the proportion of somatic changes in genomic copy number for which a rearrangement was found. Thirty out of forty-one (73%) copy number changes examined were detected as rearrangements. However, an additional seven (17%) were at centromeres or other regions where alignment is expected to be difficult. Of the 51 rearrangements, 75% were confirmed as somatic by PCR across the breakpoint junction or the presence of a copy number change in the appropriate position. Of the 37 somatic rearrangements confirmed by PCR and mapped to the base-pair level (Table 1, Fig. 1 and Supplementary Table 3), 3 were interchromosomal and 34 were intrachromosomal, including 25 deletions, 6 inversions, 2 duplications and 1 large intrachromosomal event.

All well-supported somatic copy number changes and regions of loss of heterozygosity (LOH) detected using genotyping arrays were found by whole-genome sequencing (Fig. 1). Two regions of high copy number genomic amplification, on chromosomes 3p and 15p, and two homozygous deletions, on chromosomes 10q and 16q, were observed. Somatic mutations of mitochondria in cancer have previously been reported, although their role in oncogenesis is unclear¹⁷. No somatic mutations present in more than 20% of mitochondrial sequences were observed in COLO-829.

Mutations of coding genes and microRNAs

A total of 292 somatic base substitutions were in protein-coding sequences (Table 1 and Supplementary Table 4). Of these, 187 caused amino acid changes (non-synonymous), including 172 that were missense and 15 nonsense, and 7 affected highly conserved bases at splice sites. There were 105 silent (synonymous) substitutions. One somatic substitution was found in the microRNA hsa-mir-518d in the central region of the stem structure. None of the 66 confirmed small insertions and deletions was in coding sequences.

On the assumption that most silent mutations are biologically inert, the ratio of non-synonymous to synonymous substitutions in protein-coding sequences can be used to estimate the extent of selection overall on non-synonymous changes^6,18. The ratio in COLO-829 was 1.78, not significantly different from that expected by chance (P = 0.5). Thus, the substantial majority of mutations do not seem to be subject to positive or negative selection. However, this test will be insensitive to small numbers of selected mutations.

Several individual substitutions highlighted candidate novel cancer genes. For example, two heterozygous missense mutations (het, p.S229L; het, p.D283H) were in SPDEF, which encodes a member of the ETS transcription factor family¹⁹. SPDEF expression is associated with disease progression in prostate, breast and ovarian cancer^20,21. Sequencing SPDEF through 48 additional, untreated metastatic melanomas revealed a further somatic mutation (p.W158*). A missense mutation (het, p.G297E) was identified in the matrix metalloproteinase gene MMP28. Missense mutations in the family of matrix metalloproteinases were recently reported in melanoma, including several in MMP28 (ref. 22). A missense mutation (het, p.N561K) was also identified in UVRAG, which was originally identified in a complementation assay for ultraviolet light sensitivity in xeroderma pigmentosum group C cells, has recently been shown to have an important role in autophagy and has been proposed as a tumour suppressor gene^23,24. To determine the significance of these and other mutations, analysis of additional melanomas will be required.

Of the 37 somatic rearrangements mapped to the base pair, 19 interrupted protein-coding genes. No in-frame fusion genes were predicted, but one gene (MAGI2) had an exonic deletion that creates a predicted in-frame rearranged transcript. PCR with reverse transcription (RT–PCR) across the exon–exon rearrangement boundary showed that the chimaeric transcript is expressed, but the biological significance of this rearrangement remains to be clarified.

Rearrangements occur at a high frequency in cancer over fragile sites. In COLO-829, rearrangements were found in FHIT and WWOX, which overlie the fragile sites FRA3B and FRA16D, respectively. Both rearrangements are contained within individual introns, however, and are not predicted to alter the sequences of the encoded proteins.

There are several modest copy-number reductions and increases that affect large genomic regions and which alter the copy number of many protein-coding genes (Fig. 1). Of particular interest, however, is the relatively restricted region of high (8- to 12-fold) copy number increase on chromosome 3p which contains four complete genes: RARB, TOP2B, NGLY1 and KS (OXSM). A ∼0.5-megabase (Mb) region on chromosome 15 is also amplified to 4–6 copies and contains MKRN3 and NDN. None of these genes has previously been implicated in cancer development through amplification, although another member of the retinoic receptor family, RARA, is consistently rearranged in acute promyelocytic leukaemia²⁵. A 12-kilobase (kb) internal homozygous deletion of the recessive cancer gene PTEN is predicted to cause premature termination and is presumably implicated in the development of COLO-829. However, the significance of a 40-kb homozygous deletion that removes the start ATG codon of CNOT1, a repressor of transcriptional activation by oestrogen receptor²⁶, is unknown.

Of the three previously identified driver mutations in COLO-829, genome-wide sequencing revealed BRAF V600E in addition to the PTEN deletion described above. A 2-bp deletion in CDKN2A was not detected, but was found in sequence reads when a targeted search was made.

A mutational signature of ultraviolet exposure

Most somatic base substitutions in COLO-829 were C>T /G>A transitions (Fig. 2a). Of the 510 dinucleotide substitutions, 360 were CC>TT/GG>AA changes. This mutational spectrum is reminiscent of that previously associated with ultraviolet light exposure, a known environmental risk factor in the development of malignant melanoma^27,28,29,30.

Figure 2: **Patterns of somatic substitution.**

DNA damage due to ultraviolet light leads to the formation of covalent links between two adjacent pyrimidines³¹. Consequently, C>T mutations due to ultraviolet light usually occur at dipyrimidine sequences. Therefore, to evaluate further the role of ultraviolet light in the pathogenesis of somatic mutations in COLO-829, we examined the sequence context of C>T substitutions (Fig. 2b). A total of 92% of C>T mutations occurred at the 3′ base of a pyrimidine dinucleotide compared with 53% expected by chance (P < 0.0001). CC>TT mutations exhibited a similar pattern (82% were 3′ to a pyrimidine, P < 0.0001). The frequency of C>T and CC>TT mutations due to ultraviolet light exposure is also known to be higher at CpG dinucleotides²⁷. In COLO-829, both C>T substitutions (7.7%, P < 0.0001) and CC>TT double substitutions (10.0%, P = 0.014) showed elevated frequencies at CpG dinucleotides compared to that expected by chance (4.4%). Therefore, the mutation spectrum and sequence context indicate that most C>T/G>A somatic substitutions in COLO-829 are attributable to ultraviolet-light-induced DNA damage.

Expression-directed DNA repair

Ultraviolet-light-induced DNA damage is predominantly repaired by nucleotide excision repair (NER)^32,33. NER is a nonspecific repair process activated by sensing of DNA distortion caused by the DNA modification induced by a mutagen. Over the whole genome, DNA damage is sensed by a specific protein complex and NER is then implemented through excision of an oligonucleotide carrying the damage and filling of the gap by replicative polymerases. The main purpose of this form of NER, known as global genome repair, may be prevention of somatic mutations that would otherwise be transmitted to daughter cells through mitosis. There is, however, an additional form of NER that is selectively directed at the transcribed strand of genes^34,35. This transcription-coupled repair is sensed by the stalling of RNA polymerase II when it encounters a bulky DNA adduct. The purpose of transcription-coupled repair may primarily be to reduce interference with transcription caused by DNA damage.

A consequence of transcription-coupled repair is that DNA damage on the transcribed strand is repaired more efficiently than damage on the non-transcribed strand. Thus, fewer mutations accumulate on the transcribed strand. We investigated whether transcription-coupled repair had been operative in COLO-829 by comparing the number of C>T mutations found on the transcribed strands of all 21,417 protein-coding genes to the number of C>T mutations on the non-transcribed strands. There were 2,773 C>T changes on the transcribed strands and 4,058 on the non-transcribed strands (P < 0.0001) (Fig. 2c), with a similar pattern for CC>TT double substitutions (48 on transcribed strands and 75 on non-transcribed strands). The results are therefore consistent with the past operation of transcription-coupled repair on ultraviolet-light-induced DNA damage in COLO-829.

The involvement of transcription-coupled repair predicts an overall lower prevalence of C>T/G>A substitutions over genic than intergenic DNA. Indeed, only 10,004 (30%) somatic C>T changes were found over the combined genomic footprints of transcribed genes compared with 13,164 (40%) expected if the mutations had been randomly distributed (Table 1, P < 0.0001). However, the effect of transcription-coupled repair accounts only for one-third of the deficit of mutations over protein-coding gene footprints. This suggests the existence of an additional class of NER which is preferentially deployed to both transcribed and non-transcribed strands of genes compared to intergenic DNA. A form of NER with these features has been previously proposed but is not extensively characterized³². We also observed a reduction in mutation prevalence in exons (8.33 per Mb) compared to introns (9.93 per Mb, P = 0.0001). Preferential targeting of NER to exons has not been reported and it is conceivable that this effect may, in part, be attributable to negative selection on coding sequence mutations.

To investigate further the relationship between transcription and NER we examined the correlation between mutation prevalence and gene expression (Fig. 2d). Genes expressed at a high level in COLO-829 showed a lower prevalence of somatic mutations compared to genes expressed at a low level, on both the transcribed and non-transcribed strands. The effect of transcript levels on mutation prevalence was most pronounced for C>T/G>A mutations (P < 0.0001) but was also observed in other mutational classes. A trend towards a higher prevalence of somatic substitutions at the 3′ compared to the 5′ ends of genes was also observed (Fig. 2e). This may be due to aborted transcription, such that 3′ ends are overall less transcribed than 5′ ends, with the consequence that expression-related repair processes are deployed less at 3′ ends and hence the mutation prevalence is higher.

Thus, the catalogue of somatic mutations in COLO-829 has exposed multiple levels of selective application of DNA repair: preferential targeting to transcribed compared to untranscribed genomic regions; to exons compared to introns; to transcribed DNA strands compared to non-transcribed strands; and to the 5′ compared to the 3′ ends of genes.

Evidence for other mutational processes

C>A/G>T transversions constitute the second commonest class of substitution in COLO-829 and exhibit a distinctive sequence context (Fig. 2b). C>A/G>T mutations also show evidence of transcription-coupled repair (P = 0.002) (Fig. 2c, d). However, the bias is against G>T mutations on the transcribed strand, suggesting that the original DNA damage was predominantly on guanine (Fig. 2c). There is substantial evidence that mutagens such as reactive oxygen species predominantly cause damage to guanine which ultimately results in G>T changes. Indeed, results from experimental systems suggest that the sequence context observed in COLO-829 for C>A/G>T transversions is optimal for this class of mutagen³⁶. Thus, in addition to the direct effects of ultraviolet light, the somatic mutational catalogue of COLO-829 may harbour traces of subsidiary mechanisms of DNA damage.

Timing of somatic mutations

The catalogue of somatic mutations is a cumulative record of the mutational events that have occurred during the lineage of cell divisions starting from the fertilized egg and ending in the cancer cell. By combining information on chromosome copy number change with base substitutions the relative order of some mutations on this mitotic lineage can be established.

Several genomic regions in COLO-829 show evidence of loss of one parental chromosome, leading to LOH, followed by re-duplication of the remaining copy. In these regions, mutations which occurred before the re-duplication event will be homozygous, whereas those arising after re-duplication will be heterozygous. In most such regions, a small fraction of mutations are heterozygous, indicating relatively late re-duplication (Fig. 1). However, in a region of LOH on chromosome 1q, there are more heterozygote substitutions than homozygote, suggesting earlier re-duplication (Fig. 1). In this region we were therefore able to compare somatic substitutions that occurred relatively early in the evolution of the cancer (homozygous mutations) with those that occurred later (heterozygous mutations), and found differences in their mutational spectra. C>T mutations account for a higher proportion of early (82%) compared to late mutations (53%, P < 0.0001). By contrast, C>A/G>T changes account for a higher proportion of late (19%) compared to early (2%) mutations. The results suggest that exposure to ultraviolet light was extinguished after the 1q re-duplication event, perhaps when the melanoma metastasized, resulting in a reduced rate of C>T/G>A mutation. Moreover, the increased proportion of C>A/G>T in the later phase suggests that the process underlying these mutations is, at least in part, unrelated to ultraviolet light exposure.

Discussion

We have generated the first comprehensive catalogue of somatic mutations from a human cancer genome. The catalogue includes the overwhelming majority of mutations present in COLO-829, although there remains a small fraction of base substitutions and rearrangements that have eluded detection and a small number of false positives. For insertions and deletions the sensitivity is lower and the false-positive rate is higher, indicating the need for additional computational approaches to their discovery. Epigenetic changes have not been included as the technologies for their exhaustive detection are still under development.

The catalogue of mutations in COLO-829 carries the imprint of past ultraviolet-light-induced DNA damage together with evidence for auxiliary, independent mechanisms of damage. The catalogue also bears traces of the DNA repair processes that have been operative, including transcription-coupled NER and other, less well characterized patterns of NER deployment. Buried within it are most of the driver mutations that have conferred selective growth advantage on this melanoma. The total number of drivers for this, or any other, cancer is currently unknown. Some may be among the 187 non-synonymous substitutions in protein-coding sequences. However, the possibility of drivers in non-coding RNAs, regulatory or currently cryptic functional regions of the genome can now be explored.

In future, the generation of thousands of comprehensive, high-quality catalogues of somatic mutation will provide powerful insights into the processes of DNA damage, mutation, repair and selection that underlie the evolution of all human cancers. These will form the basis of our understanding of cancer causation and development, providing the foundation for prevention and treatment.

Methods Summary

We sequenced 75 bases from both ends of 200-bp and 400-bp libraries constructed from the genomes of COLO-829 and COLO-829BL using Illumina GAII Genome Analysers. To study structural variation, these data were supplemented with paired 50 base reads from 2-kb, 3-kb and 4-kb mate-pair libraries¹¹. Sequences were aligned to the reference human genome (NCBI36) using ELAND¹¹, MAQ³⁷ and BWA³⁸. Differences from the reference sequence were called in both cell lines. The variant set obtained from COLO-829BL was then subtracted from that of COLO-829 to establish the catalogue of somatic mutations in COLO-829.

Online Methods

Sequencing

Construction of short (200/400 bp) and mate-pair (2/3/4 kb) paired end libraries, flowcell preparation and cluster generation was as previously described¹¹. Paired end sequencing was performed on Illumina GAIIx genome analysers as described in the Illumina Genome Analyser operating manual. Primary data analysis including image analysis, base-calling and alignment was carried out with the Illumina pipeline.

Substitution detection

The tumour and normal genomes were built and SNP-called separately. Short insert 2*75 bp paired end reads were aligned to NCBI36 using ELAND¹¹ (v.1.1.1.3;). SNPs were called by CASAVA¹¹ (v1.2) but with an additional parameter: the 75 base reads were divided into equal thirds (bins); we then only called substitutions where the non-reference base was observed at least once either in all three bins, or at least twice in the first bin and once in the second bin. This parameter removes most false-positive SNPs that are due to insertions and deletions (indels).

Somatic substitutions were identified as alleles that were called in the tumour genome but not in the germ line. A depth of at least 10× was required in the germ line. In order to allow for any under-called positions in the germ line, no observations of that allele were permitted in the germ line, although one call was permitted if the depth was ≥30×. Substitutions corresponding to known SNP positions (dbSNP 129) were excluded. Substitutions were annotated using Ensembl version 52.

Insertion and deletion detection

Indels in the tumour and normal genomes were called using Pindel³⁹, BWA³⁸ and GROUPER (A. J. Cox et al., manuscript in preparation) on the NCBI36 genome build. Pindel identifies singleton reads (aligned reads whose mate does not align), splits the unaligned read and attempts to align it in two portions. BWA aligns reads allowing gaps, and then samtools (http://samtools.sourceforge.net/) can be used to call indels from the alignments. GROUPER identifies clusters of singleton reads (minimum two), performs a local assembly of unmapped reads and then maps the contig back to the same region to identify the indel. Somatic indels were identified by subtracting all indels called using any of the three methods in the normal genome from indels called in the tumour (minimum of three reads) with Pindel. A minimum of 10× normal coverage was also required. Indels were annotated based on Ensembl version 52.

Structural variant detection

Abnormal read pairs that mapped to the genome at an unexpected distance or orientation were identified, grouped and filtered as previously described¹⁶. Structural variants were called from the long insert data using MAQ alignments, requiring at least ten independent read pairs in the tumour and no read pairs representing the rearrangement in the normal. Structural variants were called from the short insert data using both MAQ and ELAND alignments. When a structural variant was identified, all reads that were predicted to overlap the breakpoint region based on the location of their paired read and insert size were used to attempt an assembly across the breakpoint with Velvet⁴⁰. Assembled contigs were aligned back to the reference genome to annotate the precise breakpoint.

Confirmation by capillary sequencing

Substitutions and insertions and deletions were confirmed by capillary sequencing across the region of the mutation. Structural variants were confirmed by PCR across the breakpoint and capillary sequence. All confirmations were done in both the tumour and normal to determine if the variants were somatic or germ line.

Copy number determination

Reads were counted in windows across the genome, corrected for genome uniqueness as previously described¹⁶. A GC correction was applied to each bin count based on a linear correction for its GC content. An HMM was used to segment the data, taking into account known breakpoint locations, and to estimate the integer copy number for each segment (Supplementary Table 5).

Loss of heterozygosity detection

The zygosity of each of ∼924,000 known SNP positions (corresponding to SNPs used on the Affymetrix SNP 6.0 array) was determined in both the normal and the tumour. Regions of LOH (where SNPs were heterozygous in the normal but homozygous in the tumour) were identified using an HMM (Supplementary Table 6).

Substitution analysis

For mutation context, the bases ±10 bp from the mutation were extracted and the number of each base counted. Context of equivalent changes (for example, C>T and G>A) were combined. The background context was determined from 100,000 random positions of the equivalent base type from chromosome 2. The calculation of mutation frequency in transcribed and untranscribed regions was corrected for sequence gaps and regions of low coverage. Strand bias was calculated based on annotating each mutation as to whether it fell on the transcribed or untranscribed strand in Ensembl 52. Gene expression data were derived from the Affymetrix U133 Plus 2.0 array, run in triplicate, normalized and averaged. Poisson regression was used for the analysis of the effects of gene expression on mutation prevalence, incorporating the number of at-risk bases in each gene footprint as the offset, allowing quadratic terms for the relationship between expression and mutation prevalence, and using a dummy variable for transcribed versus non-transcribed strand mutations. Duplication timing was determined by calculating the frequency of heterozygous and homozygous mutations per Mb throughout LOH regions of at least 5 Mb in size, excluding gaps in the reference sequence. The image of the entire COLO-829 genome was produced using Circos⁴¹.

Accession codes

Primary accessions

EMBL/GenBank/DDBJ

EGAS00000000052

Data deposits

Genome sequence data have been deposited at the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega/), which is hosted by the EBI, under accession number EGAS00000000052.

Change history

14 January 2010
Data accession information was added on 14 January 2010.

References

Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009)
Article ADS CAS Google Scholar
Futreal, P. A. et al. A census of human cancer genes. Nature Rev. Cancer 4, 177–183 (2004)
Article CAS Google Scholar
Pfeifer, G. P. & Besaratinia, A. Mutational spectra of human cancer. Hum. Genet. 125, 493–506 (2009)
Article CAS Google Scholar
Futreal, P. A. et al. Cancer and genomics. Nature 409, 850–852 (2001)
Article ADS CAS Google Scholar
The Cancer Genome Atlas Research Network Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061–1068 (2008)
Article Google Scholar
Greenman, C. et al. Patterns of somatic mutation in human cancer genomes. Nature 446, 153–158 (2007)
Article ADS CAS Google Scholar
Wood, L. D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108–1113 (2007)
Article ADS CAS Google Scholar
Ding, L. et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 1069–1075 (2008)
Article ADS CAS Google Scholar
Jones, S. et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321, 1801–1806 (2008)
Article ADS CAS Google Scholar
Parsons, D. W. et al. An integrated genomic analysis of human glioblastoma multiforme. Science 321, 1807–1812 (2008)
Article ADS CAS Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008)
Article ADS CAS Google Scholar
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)
Article ADS CAS Google Scholar
Ley, T. J. et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66–72 (2008)
Article ADS CAS Google Scholar
Mardis, E. R. et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. N. Engl. J. Med. 361, 1058–1066 (2009)
Article CAS Google Scholar
Morse, H. G. & Moore, G. E. Cytogenetic homogeneity in eight independent sites in a case of malignant melanoma. Cancer Genet. Cytogenet. 69, 108–112 (1993)
Article CAS Google Scholar
Campbell, P. J. et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature Genet. 40, 722–729 (2008)
Article CAS Google Scholar
Chatterjee, A., Mambo, E. & Sidransky, D. Mitochondrial DNA mutations in human cancer. Oncogene 25, 4663–4674 (2006)
Article CAS Google Scholar
Greenman, C., Wooster, R., Futreal, P. A., Stratton, M. R. & Easton, D. F. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 173, 2187–2198 (2006)
Article CAS Google Scholar
Oettgen, P. et al. PDEF, a novel prostate epithelium-specific ets transcription factor, interacts with the androgen receptor and activates prostate-specific antigen gene expression. J. Biol. Chem. 275, 1216–1225 (2000)
Article CAS Google Scholar
Sood, A. K. et al. Expression characteristics of prostate-derived Ets factor support a role in breast and prostate cancer progression. Hum. Pathol. 38, 1628–1638 (2007)
Article CAS Google Scholar
Ghadersohi, A. et al. Prostate-derived Ets transcription factor as a favorable prognostic marker in ovarian cancer patients. Int. J. Cancer 123, 1376–1384 (2008)
Article CAS Google Scholar
Palavalli, L. H. et al. Analysis of the matrix metalloproteinase family reveals that MMP8 is often mutated in melanoma. Nature Genet. 41, 518–520 (2009)
Article CAS Google Scholar
Teitz, T. et al. Isolation by polymerase chain reaction of a cDNA whose product partially complements the ultraviolet sensitivity of xeroderma pigmentosum group C cells. Gene 87, 295–298 (1990)
Article CAS Google Scholar
Liang, C. et al. Autophagic and tumour suppressor activity of a novel Beclin1-binding protein UVRAG. Nature Cell Biol. 8, 688–698 (2006)
Article CAS Google Scholar
Scaglioni, P. P. & Pandolfi, P. P. The theory of APL revisited. Curr. Top. Microbiol. Immunol. 313, 85–100 (2007)
CAS PubMed Google Scholar
Winkler, G. S., Mulder, K. W., Bardwell, V. J., Kalkhoven, E. & Timmers, H. T. Human Ccr4-Not complex is a ligand-dependent repressor of nuclear receptor-mediated transcription. EMBO J. 25, 3089–3099 (2006)
Article CAS Google Scholar
Pfeifer, G. P., You, Y. H. & Besaratinia, A. Mutations induced by ultraviolet light. Mutat. Res. 571, 19–31 (2005)
Article CAS Google Scholar
Armstrong, B. K. & Kricker, A. The epidemiology of UV induced skin cancer. J. Photochem. Photobiol. B 63, 8–18 (2001)
Article CAS Google Scholar
Leiter, U. & Garbe, C. Epidemiology of melanoma and nonmelanoma skin cancer—the role of sunlight. Adv. Exp. Med. Biol. 624, 89–103 (2008)
Article Google Scholar
Giglia-Mari, G. & Sarasin, A. TP53 mutations in human skin cancers. Hum. Mutat. 21, 217–228 (2003)
Article CAS Google Scholar
Daya-Grosjean, L. & Sarasin, A. The role of UV induced lesions in skin carcinogenesis: an overview of oncogene and tumor suppressor gene modifications in xeroderma pigmentosum skin tumors. Mutat. Res. 571, 43–56 (2005)
Article CAS Google Scholar
Nouspikel, T. DNA repair in mammalian cells: Nucleotide excision repair: variations on versatility. Cell. Mol. Life Sci. 66, 994–1009 (2009)
Article CAS Google Scholar
Shuck, S. C., Short, E. A. & Turchi, J. J. Eukaryotic nucleotide excision repair: from understanding mechanisms to influencing biology. Cell Res. 18, 64–72 (2008)
Article CAS Google Scholar
Hanawalt, P. C. & Spivak, G. Transcription-coupled DNA repair: two decades of progress and surprises. Nature Rev. Mol. Cell Biol. 9, 958–970 (2008)
Article CAS Google Scholar
Fousteri, M. & Mullenders, L. H. Transcription-coupled nucleotide excision repair in mammalian cells: molecular mechanisms and biological effects. Cell Res. 18, 73–84 (2008)
Article CAS Google Scholar
Kawanishi, S., Hiraku, Y. & Oikawa, S. Mechanism of guanine-specific DNA damage by oxidative stress and its role in carcinogenesis and aging. Mutat. Res. 488, 65–76 (2001)
Article CAS Google Scholar
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851–1858 (2008)
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Article CAS Google Scholar
Ye, K. et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009)
Article CAS Google Scholar
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Article CAS Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009)
Article CAS Google Scholar

Download references

Acknowledgements

The authors acknowledge the support of C. Henry, S. Kahn, D. Evers, G. Smith, K. Hall and the technical and administrative support staff at Illumina. I.V. is supported by the Human Frontiers Science Program and P.J.C. by the Kay Kendall Leukaemia Fund. We would like to acknowledge the Wellcome Trust for support under grant reference 077012/Z/05/Z.

Author Contributions M.R.S., E.D.P., P.A.F., P.J.C. and D.R.B. designed the experiment. S.E., S.J.H., J.A., P.I.K.-G., N.A.G., C.D.H., M.M.H., T.J., Z.K. and D.W. carried out laboratory analysis. L. Chin provided the clinical sample. E.D.P., R.K.C., P.J.S., D.J.M., S.J.H., C.D.G., I.V., M.-L.L., G.R.O., G.R.B., K.Y., J.A., M.J.B., D.B., A.B., R.J.C., L. Chen, A.J.C., P.I.K.-G., N.A.G., R.J.G., C.D.H., M.M.H., T.J., M.J., Z.K., C.L., J.M., A.M., L.J.M., Z.N., T.R., O.B.S.-T., A.S., L.A.S., L.S., J.T., D.W., M.T.R., P.J.C., D.R.B., P.A.F. and M.R.S. performed data analysis, informatics and statistics. M.R.S., E.D.P., D.R.B., M.T.R., R.K.C., P.A.F. and P.J.C. wrote the manuscript.

Author information

Erin D. Pleasance and R. Keira Cheetham: These authors contributed equally to this work.

Authors and Affiliations

Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK
Erin D. Pleasance, Philip J. Stephens, David J. McBride, Chris D. Greenman, Ignacio Varela, Meng-Lay Lin, Gonzalo R. Ordóñez, Graham R. Bignell, David Beare, Adam Butler, Lina Chen, Sarah Edkins, Mingming Jia, Catherine Leroy, John Marshall, Andrew Menzies, Laura J. Mudie, Zemin Ning, Lucy A. Stebbings, Jon Teague, Peter J. Campbell, P. Andrew Futreal & Michael R. Stratton
Illumina Cambridge Ltd, Chesterford Research Park, Little Chesterford, Essex CB10 1XL, UK ,
R. Keira Cheetham, Sean J. Humphray, Markus J. Bauer, Richard J. Carter, Anthony J. Cox, Paula I. Kokko-Gonzales, Niall A. Gormley, Russell J. Grocock, Matthew M. Hims, Terena James, Zoya Kingsbury, Ole B. Schulz-Trieglaff, Anastassia Spiridou, Lukasz Szajkowski, Mark T. Ross & David R. Bentley
Departments of Molecular Epidemiology, Medical Statistics and Bioinformatics, Leiden University Medical Center, Einthovenweg 20, 2333 ZC Leiden, the Netherlands,
Kai Ye
Illumina Inc., Corporate Headquarters, 9865 Towne Centre Drive, San Diego, California 92121, USA ,
Julie Alipaz & Tom Royce
Illumina Hayward, 25861 Industrial Bvld, Hayward, California 94545, USA ,
Christian D. Haudenschild & David Williamson
Dana-Farber Cancer Institute, 44 Binney Street, Boston, Massachusetts 02115, USA ,
Lynda Chin
Institute of Cancer Research, Sutton, Surrey SM2 5NG, UK ,
Michael R. Stratton

Authors

Erin D. Pleasance
View author publications
You can also search for this author in PubMed Google Scholar
R. Keira Cheetham
View author publications
You can also search for this author in PubMed Google Scholar
Philip J. Stephens
View author publications
You can also search for this author in PubMed Google Scholar
David J. McBride
View author publications
You can also search for this author in PubMed Google Scholar
Sean J. Humphray
View author publications
You can also search for this author in PubMed Google Scholar
Chris D. Greenman
View author publications
You can also search for this author in PubMed Google Scholar
Ignacio Varela
View author publications
You can also search for this author in PubMed Google Scholar
Meng-Lay Lin
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo R. Ordóñez
View author publications
You can also search for this author in PubMed Google Scholar
Graham R. Bignell
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ye
View author publications
You can also search for this author in PubMed Google Scholar
Julie Alipaz
View author publications
You can also search for this author in PubMed Google Scholar
Markus J. Bauer
View author publications
You can also search for this author in PubMed Google Scholar
David Beare
View author publications
You can also search for this author in PubMed Google Scholar
Adam Butler
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Carter
View author publications
You can also search for this author in PubMed Google Scholar
Lina Chen
View author publications
You can also search for this author in PubMed Google Scholar
Anthony J. Cox
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Edkins
View author publications
You can also search for this author in PubMed Google Scholar
Paula I. Kokko-Gonzales
View author publications
You can also search for this author in PubMed Google Scholar
Niall A. Gormley
View author publications
You can also search for this author in PubMed Google Scholar
Russell J. Grocock
View author publications
You can also search for this author in PubMed Google Scholar
Christian D. Haudenschild
View author publications
You can also search for this author in PubMed Google Scholar
Matthew M. Hims
View author publications
You can also search for this author in PubMed Google Scholar
Terena James
View author publications
You can also search for this author in PubMed Google Scholar
Mingming Jia
View author publications
You can also search for this author in PubMed Google Scholar
Zoya Kingsbury
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Leroy
View author publications
You can also search for this author in PubMed Google Scholar
John Marshall
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Menzies
View author publications
You can also search for this author in PubMed Google Scholar
Laura J. Mudie
View author publications
You can also search for this author in PubMed Google Scholar
Zemin Ning
View author publications
You can also search for this author in PubMed Google Scholar
Tom Royce
View author publications
You can also search for this author in PubMed Google Scholar
Ole B. Schulz-Trieglaff
View author publications
You can also search for this author in PubMed Google Scholar
Anastassia Spiridou
View author publications
You can also search for this author in PubMed Google Scholar
Lucy A. Stebbings
View author publications
You can also search for this author in PubMed Google Scholar
Lukasz Szajkowski
View author publications
You can also search for this author in PubMed Google Scholar
Jon Teague
View author publications
You can also search for this author in PubMed Google Scholar
David Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Lynda Chin
View author publications
You can also search for this author in PubMed Google Scholar
Mark T. Ross
View author publications
You can also search for this author in PubMed Google Scholar
Peter J. Campbell
View author publications
You can also search for this author in PubMed Google Scholar
David R. Bentley
View author publications
You can also search for this author in PubMed Google Scholar
P. Andrew Futreal
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Stratton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to P. Andrew Futreal or Michael R. Stratton.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Figures

This file contains Supplementary Figure 1 with Legend. (PDF 118 kb)

Supplementary Table 1

All somatic substitutions. Dinucleotide substitutions are shown on adjacent lines. Validation status ‘N/A’ indicates the variant was not tested for confirmation. Validation status ‘REAL, STATUS UNKNOWN’ indicates the variant was present, but sequencing of the normal form this patient failed and therefore the confirmation does not indicate whether the variant was somatic or germline. COSMIC denotes a previously identified somatic mutation (http://www.sanger.ac.uk/genetics/CGP/cosmic/). POLYMERASE_SLIP refers to a false positive insertion or deletion generated by polymerase slippage at a homopolymer tract. (XLS 3792 kb)

Supplementary Table 2

All predicted insertions & deletions, with coding effects annotated. Validation status is indicated as in Supplementary Table 1 (XLS 171 kb)

Supplementary Table 3

All somatic structural variants confirmed and mapped to the base pair level. (XLS 26 kb)

Supplementary Table 4

All coding effects of somatic substitutions. Validation status is indicated as in Supplementary Table 1. (XLS 86 kb)

Supplementary Table 5

Copy number segments in COLO-829 genome. (XLS 31 kb)

Supplementary Table 6

Regions of loss of heterozygosity in COLO-829 genome. (XLS 21 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

Rights and permissions

This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence (http://creativecommons.org/licenses/by-nc-sa/3.0/), which permits distribution, and reproduction in any medium, provided the original author and source are credited. This licence does not permit commercial exploitation, and derivative works must be licensed under the same or similar licence.

Reprints and permissions

About this article

Cite this article

Pleasance, E., Cheetham, R., Stephens, P. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010). https://doi.org/10.1038/nature08658

Download citation

Received: 30 July 2009
Accepted: 04 November 2009
Published: 16 December 2009
Issue Date: 14 January 2010
DOI: https://doi.org/10.1038/nature08658

This article is cited by

Understanding exposomes and its relation with cancer risk in Malaysia based on epidemiological evidence: a narrative review
- Razinah Sharif
- Theng Choon Ooi
Genes and Environment (2024)
Sequence dependencies and mutation rates of localized mutational processes in cancer
- Gustav Alexander Poulsgaard
- Simon Grund Sørensen
- Jakob Skou Pedersen
Genome Medicine (2023)
TAGET: a toolkit for analyzing full-length transcripts from long-read sequencing
- Yuchao Xia
- Zijie Jin
- Ruibin Xi
Nature Communications (2023)
Strand asymmetry influences mismatch resolution during single-strand annealing
- Victoria O. Pokusaeva
- Aránzazu Rosado Diez
- Guillaume J. Filion
Genome Biology (2022)
Duplex sequencing identifies genomic features that determine susceptibility to benzo(a)pyrene-induced in vivo mutations
- Danielle P. M. LeBlanc
- Matthew Meier
- Francesco Marchetti
BMC Genomics (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.