Main

Beyond alterations to single chromosomes, changes in genomic copy number can also occur through whole-genome doubling (WGD) and chromothripsis. WGD is when the entire chromosomal content of a cell is duplicated7 from a diploid to a tetraploid state, whereas chromothripsis is a ‘genomic catastrophe’ that leads to clustered rearrangements associated with oscillating copy number patterns8. These evolutionary events may occur multiple times at different intensities during tumour development and lead to highly complex cancer genomes9.

Previously, we developed a computational framework that enables the separation of somatic mutations into mutational signatures of single base substitutions (SBSs), doublet base substitutions (DBSs), and small insertions or deletions (IDs)10,11. Analyses of mutational signatures have provided unprecedented insights into the exogenous and endogenous processes that mould cancer genomes at a single nucleotide level12. Prior studies have also examined signatures of genomic rearrangements in cancer, and these have revealed insights into cancer-subtype-specific homologous recombination deficiency (HRD) and templated insertions13,14. Moreover, advancement in the bioinformatics integration of single nucleotide mutations, rearrangements and microsatellite instability profiles have improved signal-to-noise ratios to identify cancer processes15. However, rearrangement signatures can only be derived from whole-genome sequencing (WGS) data, which significantly limits their translational usability.

We recently developed a ‘mechanism-agnostic’ approach to summarize allele-specific copy number profiles in whole-genome sequenced sarcomas16, whereby a priori information on the mutational processes active in those cancers was not known, which we term copy number signatures. Other cancer-subtype-specific methods to interrogate copy number patterns that use known hallmarks of genomic instability have been applied to multiple myeloma17, breast cancer18, ovarian cancer19 and prostate cancer20. To our knowledge, there is currently no approach that allows the interrogation of copy number signatures derived from allele-specific profiles across multiple cancer types and across different experimental assays. To address this gap, we developed a new framework to decipher copy number signatures across cancer types (Supplementary Table 1) and multiple experimental platforms.

A framework for copy number signatures

The extent of genomic instability—as measured through the number of copy number segments, the proportion of the genome displaying loss of heterozygosity (LOH) and the status of genome doubling—varied greatly among cancer types in The Cancer Genome Atlas (TCGA) (Fig. 1a, b). Nevertheless, a linear relationship was observed between the number of segments and the proportion of genomic LOH, which varies from cancers with diploid and copy number ‘quiet’ genomes (for example, acute myeloid leukaemia, thymoma and thyroid carcinoma; Fig. 1a and see Supplementary Table 1 for abbreviations of the cancer type) to cancers with highly aberrant copy number profiles (for example, high-grade serous ovarian carcinomas and sarcomas; Extended Data Fig. 1a, b). This linear relationship failed to hold only for adrenocortical carcinoma and chromophobe renal cell carcinoma, both of which demonstrated enrichment of LOH without enrichment of copy number segmentation (Extended Data Fig. 1a–c). In addition, considerable variability of ploidy was observed both between and within cancer types (Fig. 1b and Extended Data Fig. 1d).To distil this copy number heterogeneity and to capture biologically relevant copy number features, we developed a classification framework that encodes the copy number profile of a sample by summarizing the counts of segments into a 48-dimensional vector on the basis of the total copy number (TCN), the heterozygosity status and the segment size (Methods and Extended Data Fig. 1e–l).

Fig. 1: Pan-cancer copy number features of 33 tumour types from TCGA.
figure 1

a, Median number of segments in a copy number (CN) profile (x axis), median proportion of the genome that shows LOH (y axis) and the proportion of samples that have undergone one or more WGD events (size). The line of best fit from a robust linear regression is shown, whereby the colour of points indicates the weight of the tumour type in the regression model. Error bands indicate the 95% confidence interval, n = 33, t = 4.95, P = 2.5e-5. See Supplementary Table 1 for cancer type abbreviations. b, Ploidy characteristics of all samples split by tumour type. Bottom, ploidy (y axis) for each sample in a tumour type (x axis), whereby samples are coloured by their genome doubling status as follows: 0×WGD, non-genome-doubled (green); 1×WGD, genome doubled (purple); and 2×WGD, twice genome-doubled (orange). Top, proportion (Prop.) of samples in each tumour type that are 0, 1 or 2×WGD. Horizontal lines indicate median ploidies. c, Decomposition plots of 21 pan-cancer copy number signatures (CN1–CN21). Heterozygosity (Het) status and total copy number (0–9+) are indicated below each column. Segment sizes are shown on the bottom right. Increasing saturation of colour indicates increasing segment size.

To ensure the generalizability of our framework across platforms, we optimized the copy number calling strategy for each platform, which yielded a strong concordance of summary vectors between WGS, whole-exome sequencing (WES) and SNP6-profiling-derived copy number profiles (Extended Data Fig. 1m–p, Supplementary Table 1 and Methods).

Repertoire of copy number signatures

Copy number matrices (n = 9,873; Supplementary Table 1) were decomposed using our previously established and extensively validated approach for deriving a reference set of signatures10,11 (Methods). This approach enabled the identification of both the shared patterns of copy number across all examined samples and the quantification of the number of segments attributed to each copy number signature in each sample, which we termed ‘signature attribution’.

In this first iteration (Methods), we identified 21 distinct pan-cancer signatures (Fig. 1c and Supplementary Table 2). These signatures accurately reconstructed the copy number profiles of 97% of the examined TCGA samples (q value < 0.05; Methods). The remaining 3% were poorly reconstructed owing to a combination of a low number of segments and/or a high diversity of copy number states in the copy number profile or few operative signatures identified, and are unrelated to purity estimates (Extended Data Fig. 2a–e). The 21 copy number signatures (CN1–CN21) were carefully inspected and categorized into six groups on the basis of their most prevalent features. CN1 and CN2 are primarily defined by >40 Mb heterozygous segments with TCNs of 2 and 3–4 respectively. CN3 is characterized by heterozygous segments with sizes >1 Mb and TCNs between 5 and 8. CN4–CN8 each have segment sizes between 100 kb and 10 Mb but with different TCN or LOH states. CN9–CN12 each have numerous LOH components with segment sizes <40 Mb. CN13–CN16 have whole-arm-scale or whole-chromosome-scale LOH events (>40 Mb). CN17 consists of LOH segments with TCNs between 2 and 4 as well as heterozygous segments with TCNs between 3 and 8, each with segment sizes 1–40 Mb. CN18–CN21 exhibit complex patterns of copy number alterations that are uncommon but are seen in distinct cancer types. In addition, three signatures (CN22–CN24) indicative of copy number profile oversegmentation were identified (Extended Data Fig. 2f).

We also systematically examined copy number signatures derived from WGS, WES and SNP6 profiles of the same samples. The results from this analysis demonstrated a strong concordance between signatures identified through different platforms (median cosine similarity of >0.8) (Extended Data Figs. 1m and 2g–j, and Supplementary Table 2) and different copy number callers (median cosine similarity of 0.98) (Extended Data Fig. 2k–l and Supplementary Table 2).

Transitional nature of copy number signatures

The catalogue of somatic mutations in a cancer genome is the cumulative result of the mutational processes that have been operative over the lifetime of the cell of origin21. Analyses of SBS and ID mutational signatures have used assumptions and prior evidence that individual mutations are independent and additive12. However, this assumption is violated for large-scale macro-evolutionary events such as WGD22. Moreover, there are inherent challenges in inferring WGD using copy number calling algorithms that affect subclonal tumour reconstruction23. We therefore generated several synergistic lines of evidence to investigate the impact of WGD on copy number signatures. First, we undertook copy number profiling of experimentally ploidy-sorted populations of undifferentiated soft tissue sarcoma (Supplementary Table 3 and Extended Data Fig. 3a–f). Second, each copy number signature was tested for enrichment in non-, once- or twice-genome doubled samples (Extended Data Fig. 3g, h and Supplementary Table 3). Third, in silico simulations of genome doubling on the extracted signatures were performed (Methods, Extended Data Fig. 3i and Supplementary Table 3). Fourth, copy number profiles arising from dynamics of WGD and chromosomal instability (CIN) were simulated (Extended Data Fig. 3j) and re-examined for the previously derived signatures (Extended Data Fig. 3k and Supplementary Table 3).

By combining the preceding set of in silico simulations and wet-laboratory experiments, we confirmed the transitional nature of copy number signatures, with one signature being completely effaced by another after WGD (Extended Data Fig. 3l). In this model, a cancer with a diploid signature (CN1), may undergo WGD, which alters the signature CN1 into signature CN2. Alternatively, a cancer may show a CIN-transforming signature of CN1 into signature CN9. Through a combination of CIN and WGD, signature CN2 may transform into signature CN3. Meanwhile, CN13–CN15 are linked through successive WGD events on the background of early chromosomal losses.

Although WGD has a transitional effect on copy number signatures, we hypothesized that smaller scale events, such as segmental aneuploidy, may reflect additive behaviour similar to mutational signatures. To investigate this, we focused on the ploidy-associated signatures CN1 (diploid) and CN2 (tetraploid), for which an attribution of both signatures together indicates a hyperdiploid or subtetraploid profile (Extended Data Fig. 3m and Supplementary Table 3). We mapped these signatures across the cancer genomes so that only CN1 and CN2 were attributed (CN1 + CN2 = 1) and had a mixed attribution of those signatures (CN1 × CN2 > 0.15). This analysis recapitulated known patterns of aneuploidy in human cancer24, including gains of chromosomes 1q, 7, 8q, 16p, 17q and 20 in more than 50% of TCGA samples (Extended Data Fig. 3n).

The landscape of copy number signatures

Next, we surveyed the distribution of the 21 signatures across different cancer types (Fig. 2 and Supplementary Table 4). The ploidy-associated signatures CN1 and CN2 were found in most samples across all cancer types. Signatures CN4, CN7, CN10, CN18, CN20 and CN21 were derived through specific cancer type extractions and therefore unique to uveal melanoma, breast cancer, lung squamous carcinoma, ovarian carcinoma, liver cancer and paragangliomas, respectively. Signatures CN4–CN8 all showed segments of high TCNs and were seen in tumour types with known prevalent amplicon events25. CN9–CN12 showed differing patterns of hypodiploidy, with segment sizes of LOH < 40 Mb and WGD that was reflective of a type of structural CIN often induced by replication stress26. Signatures CN14 and CN16 were prevalent in adrenocortical carcinoma and chromophobe renal cell carcinoma, which indicates a link with the known patterns of chromosomal-scale LOH (cLOH) seen in these cancers27,28. Signature CN17 was prevalent in tumour types previously described as being HRD and enriched in the tandem duplicator phenotype (TDP)29. Different cancer lineages clustered together on the basis of the prevalence of signatures; namely TDP, WGD, diploid CIN, simple diploidy and cLOH (Fig. 2). This segregation of cancer types and their constituent signatures reflects the genomic heterogeneity imparted through WGD, chromothripsis and aneuploidy in human cancer5,7.

Fig. 2: Distribution of copy number signatures across human cancers.
figure 2

Attributions of the 21 signatures (y axis) split by tumour type (x axis). The size of each dot represents the proportion of samples of each tumour type that shows the signature and the colour reflects the median attribution of the signature in each tumour type. Tumour/signature attributions with less than 5% of samples are not shown. Hierarchical clustering is shown below, sample sizes are shown above. aCN15 was identified from an extraction of high LOH samples (>70% of the genome LOH), and is not found at ≥5% frequency in any tumour type. bCN4 was identified in UVM at <5% frequency. Het mix, mixture of heterozygous segments.

Signatures associated with chromothripsis

Oncogene amplification is associated with aggressive behaviour in cancer25. Reasoning that signatures with high levels of TCN (CN4–CN8) could be associated with genomic amplification, we correlated these signatures with known classes of amplicons25,30. All amplicon signatures were positively associated with one or more amplicon types (Fig. 3a and Extended Data Fig. 4). CN8, which shows very high copy number states and is enriched in nine cancer types (two-sided Mann–Whitney test, q < 0.05), was strongly associated with all four classes of amplicons, although most strongly with extra-chromosomal circular DNA amplicons (ecDNA) and the recently described large amplicon phenotype termed ‘tyfonas’31 (Extended Data Fig. 4a).

Fig. 3: Biological inference of copy number signatures.
figure 3

a, Associations between signatures (y axis) and amplicon structures (x axis), displaying the q value (size) and log2(OR) (colour) from two-sided Fisher’s exact tests of genomic regions unattributed or attributed to each signature against each amplicon type. Only significant (q < 0.05) associations are shown. BFB, breakage–fusion–bridge. b, Enrichment of mapped CN8 in 1-Mb windows of the human genome across 8 cancer types in which ≥40 samples were attributed CN8. Colour indicates the –log2(q value) from a bootstrapping analysis to determine significance. An ideogram of chromosome bands is shown above. c, Single-cell sequencing from a near-genome-wide LOH undifferentiated soft tissue sarcoma. Sorted populations of cells based on ploidy and proliferation (left) were single-cell sequenced and copy number profiled (middle, representative cells). Copy number (y axis) across the genome (x axis) is given for both the major (blue) and minor (orange) allele. Copy number summaries (red) and signatures (blue) recapitulate the pattern seen in the copy number profiles (right). d, Association between mutational status of key HR pathway genes and CN17 attribution from a multivariate two-sided logistic regression model including cancer type as a covariate. NS, not significant (P ≥ 0.05). Squares represent point estimates for the odds ratio (OR). Horizontal lines indicate 95% confidence intervals. n = 4,919 biologically independent tumours. Bi., bi-allelic alteration; Mono., monoallelic alteration; WT, wild type. e, Association between signature attribution and scarHRD score, an orthogonal test for HRD, displaying –log2(q) (y axis) and log2(OR) (x axis) from two-sided Fisher’s exact tests in which scarHRD positivity was based on a threshold of >42. A half dot indicates an infinite –log2(q value) (q = 0). f, Correlation between copy number signature (x axis) attribution and SBS  or ID signature (y axis) exposure across TCGA exomes (left) and whole genomes (right). The strength of correlation is indicated by colour (orange, anticorrelated, blue, correlated), the q value is indicated by point size. Non-significant (q > 0.01) associations are not shown.

Recent evidence shows that genomic amplification can evolve through inter-related processes of chromothripsis, breakage–fusion–bridge and ecDNA formation32. To test this finding, we mapped the copy number signatures with known regions of chromothripsis33 across the cancer genome (Methods), which revealed that CN5–CN8 are enriched in chromothriptic regions (Extended Data Fig. 4b and Supplementary Table 5). Each of these signatures was dominated by small segments, while CN7 and CN8 were both strongly associated with amplified chromothripsis33 (Extended Data Fig. 4c), larger DNA segments and complex chromothriptic events (Extended Data Fig. 4d). Simulations of copy number profiles incorporating processes of chromothripsis, WGD and chromosomal duplication (Extended Data Fig. 4e) demonstrated that CN4–CN8 can be generated through chromothripsis-like events. Moreover, these signatures reflected distinct life histories of tumours, such as chromothripsis before or after WGD (Extended Data Fig. 4d, f and Supplementary Table 5).

Chromothripsis and gene amplification are both independently associated with poor prognosis25,34. Attribution of any of the five amplicon signatures in their respective cancer types showed poor disease-specific survival in a univariate pan-cancer analysis (Extended Data Fig. 5a and Supplementary Table 5). Similarly, multiple amplicon signatures were associated with reduced disease-specific survival in multivariate pan-cancer and cancer subtype analyses, with consistent results from analyses based on Cox-model hazard ratios (Extended Data Fig. 5b, c and Supplementary Table 5) and analyses based on accelerated failure times (Extended Data Fig. 5d, e and Supplementary Table 5). For example, a cancer-type-specific survival analysis revealed that patients with glioblastoma with operative signature CN5 had poor disease-specific survival (172 days reduced median survival; Extended Data Fig. 5f and Supplementary Table 5). To determine the topographical localization of the amplification events, we mapped the most common amplicon signature with the highest amplification level, CN8, across the genome in eight cancer types (n ≥ 40 in each type) that were attributed CN8, and assessed CN8 enrichment in each cancer type through a bootstrapping analysis. This revealed cancer-type-specific enrichment of CN8 in regions harbouring oncogenes that are commonly amplified in their respective cancer types (Fig. 3b and Supplementary Table 5).

Signatures associated with LOH

LOH is an important mechanism that contributes to the inactivation of tumour suppressor genes during cancer development6,33,35. Nine signatures were positively correlated with LOH regions of the genome (Extended Data Fig. 6a) and were recurrently found around known tumour suppressor genes (Extended Data Fig. 6b and Supplementary Table 6). Four of these signatures (CN9–CN12) exhibited predominantly small segment sizes and very few that were >40 Mb (Fig. 1c) and were therefore termed focal LOH (fLOH) signatures.

Genome-wide chromosomal-scale losses (near haploidy), often followed by genome doubling, are associated with poor prognosis in B-cell acute lymphoblastic leukaemia36. Conversely, haploidization is associated with immune cell infiltration and a relatively better prognosis in undifferentiated soft tissue sarcoma16. This is an uncommon event in cancer (0.2% prevalence in TCGA; Extended Data Fig. 3g) but is seen in as much as 3% of sarcomas and mesotheliomas. We reasoned that this phenomenon could result in a distinctive copy number signature that could have clinical implications. We selectively extracted signatures from cancers that display an LOH of more than 70% of the genome, which revealed the distinctive signatures CN13, CN14 and CN15. We experimentally confirmed these rare signatures through ploidy sorting and single-cell DNA sequencing (SCS) of undifferentiated soft tissue sarcoma (Extended Data Fig. 6c and Supplementary Table 6), which are known to have genome-wide LOH and a complex subclonal structure16. These unique signatures were represented in multiple subclones and reflected successive WGDs on a background of genome-wide LOH (Fig. 3c). Other patterns of distinctive hypodiploidy37,38 were enriched in adrenocortical carcinoma and chromophobe renal cell carcinoma (CN14 and CN16; Extended Data Fig. 6d, e). Mapping of these signatures to the genome displayed recurrent LOH in chromosome regions 1p, 3p, 5q, 9, 10q, 13q and 17p (Extended Data Fig. 6f, g and Supplementary Table 6), which matched known patterns of aneuploidy in these cancers27,28.

An allele-specific deletion of a DNA segment harbouring an essential gene that results in LOH represents a potential therapeutic vulnerability39, and such regions have been shown to be under strong negative selection for deleterious mutations22,40,41. We hypothesized that in cancers with extensive LOH signatures, regions of the genome with a high density of essential genes may show retention of heterozygosity. An enrichment analysis revealed that regions of retained heterozygosity were enriched in essential genes compared with random selections of regions across the genome (Extended Data Fig. 6h, i). These essential-gene-enriched regions are probably subject to strong negative selection for genomic losses and therefore represent a particularly rich area to explore for therapeutics. This is particularly relevant to cancers that have extensive LOH, as tagged here with cLOH signatures in adrenocortical carcinomas, kidney chromophobe cancers and mesotheliomas.

Signatures associated with HRD

Somatic tandem duplications (TDs) are commonly found in breast cancer and ovarian cancer that show failure of homologous recombination (HR) repair of DNA double-strand breaks, for example, owing to defective BRCA1 or BRCA2 expression29,42. A detailed characterization of TD across cancer types has revealed three patterns with duplicated segments that range around 10 kb, 200 kb or 2 Mb (ref. 29). CN17 has a segment size distribution that overlaps with the largest of these three patterns and was strongly associated with TD (Extended Data Fig. 7a and Supplementary Table 7; odds ratio (OR) = 6.3, q = 3.6 × 10–17, two-sided Fisher’s exact test) and enriched in cancer types known to show TD29 (Extended Data Fig. 7b).

We found an enrichment of CN17 in samples that harbour germline and/or somatic mutations in the key HR genes BRCA1, BRCA2, PALB2, FBXW7 and CDK12, but not RAD51C (Fig. 3d and Supplementary Table 7), and in a more comprehensive analysis of the HR repair pathway (Extended Data Fig. 7c and Supplementary Table 7). In addition to mutations, epigenetic silencing of HR genes can result in HRD43. This was further investigated by examining the promoter methylation status of BRCA1 in breast cancers with CN17 attribution. This revealed levels of CN17 comparable to samples with bi-allelic loss of HRD genes (Extended Data Fig. 7d, e). Extending this to a multivariate pan-cancer analysis showed that CN17 was significantly associated with promoter hypermethylation of BRCA1 across cancer types (Extended Data Fig. 7f), in addition to CN9. Further supporting the link between CN17 and HRD, other lines of evidence, including scarHRD scores44 and SBS and ID mutational signatures from WES and WGS, showed a strong correlation with CN17 attribution (Fig. 3e, f, Extended Data Fig. 7g, h and Supplementary Table 7). In addition, positive associations were found between CN17 and the APOBEC mutational signatures SBS2 and SBS13, which are prevalent around DNA double-strand breaks45.

Genome topographical mapping of CN17 in CN17-enriched cancers revealed a distribution of LOH segments (Extended Data Fig. 7i) that was tumour-type-specific, a feature not seen in heterozygous segments (Extended Data Fig. 7j), which suggests that there is tissue-specific-selective forces associated with DNA deletions. Breast cancer, ovarian cancer and uterine carcinosarcoma displayed recurrent chromosomal LOH at 8p, 17 (including BRCA1 and TP53) and 22 (Extended Data Fig. 7k). Focal LOH was also observed on 9q around TSC1, 13q around BRCA2 and RB1, and 19p around STK11. By contrast, CN17-attributed sarcomas displayed strong peaks of recurrent LOH around known sarcoma tumour suppressor genes46 (CDKN2A, RB1 and TP53; Extended Data Fig. 7l). The six other tumour types enriched in CN17 displayed recurrent chromosomal LOH at 8p, 9p, 17p, 19p and 21 (Extended Data Fig. 7m). These findings suggest that copy number signatures could be helpful in revealing the potential mechanisms that underpin the positive selection of cancer genes.

We hypothesized that tumour microenvironmental conditions could provide an explanation for finding CN17 in cancers without mutations in HRD-related genes, as hypoxia can fuel HRD in many cancers47,48. Modelling of copy number signature attributions with comprehensive readouts of transcriptome-based hypoxia gene signatures across cancer types49 revealed a significant positive correlation with CN17 attribution and with signatures of aneuploidy. This result confirms that hypoxia is strongly associated with different patterns of genomic instability, including HRD, in cancer genomes (Extended Data Fig. 7n).

Signatures associated with cancer-driver genes

To identify genetic mechanisms that are potentially causative of copy number signature patterns, we associated somatic cancer-driver gene mutations with copy number signatures and found significant differences between cancer types. A consistent finding across cancer types was a positive association between TP53 mutations and multiple copy number signatures (Fig. 4a and Supplementary Table 8). TP53 mutations were also associated with an increased diversity of copy number signatures (Extended Data Fig. 8a; OR = 3.66, q = 3.0 × 10–51), which provides support for a link between TP53 alterations and aneuploidy5. This result was also confirmed through the observation of CIN signatures such as CN9 in SCS data from RPE1 cells in which TP53 mutations were induced and from tumours from patients with Li–Fraumeni syndrome (Extended Data Fig. 8b, c and Supplementary Table 8).

Fig. 4: Genomic associations of copy number signatures.
figure 4

Associations between copy number signatures (x axis) and driver-gene single nucleotide variant and ID status (y axis) across each TCGA tumour type (panels). Effect size (log2(OR), colour), and significance level (–log2(q), size) from two-sided Fisher’s exact tests are displayed. Non-significant (q ≥ 0.05) associations are not shown.

Mutations in RNF43, HLA-B, HLA-C and BRAF are commonly seen in microsatellite instable colon cancers and were negatively correlated with samples with tetraploid genomes (that is, CN2 attributed; Extended Data Fig. 8d). Microsatellite instability is associated with high immune cell infiltration, whereas aneuploidy is associated with a decrease in leukocyte fraction50. Across multiple cancer types, we observed a general trend of decreased leukocyte fractions in cancers with copy number signatures of aneuploidy compared to diploid cancers while accounting for purity (CN1; Extended Data Fig. 8e). Similar to colon cancer, multiple cancer-driver genes were associated with CN1 and CN2 in endometrial cancer, which was largely driven by differential copy number and mutation patterns seen in microsatellite stable and unstable tumours (Extended Data Fig. 8f). Last, we noted a positive association between CN17 and TP53 mutations in human papilloma virus (HPV) head and neck squamous cell cancer (HNSC) (Extended Data Fig. 8g and Supplementary Table 8). HNSCs are among the most hypoxic of all cancers and are associated with resistance to radiotherapy49,51. We therefore reasoned that the association seen here with HRD may actually be driven by hypoxia. Indeed, there was a significant increase in hypoxia scores in HPV-negative HNSC (Extended Data Fig. 8h).

To assess the relationships between copy number signatures and copy number driver genes, we evaluated the associations between attributions of copy number signatures and either homozygous deletions of tumour suppressor genes or amplifications of known proto-oncogenes (Methods). Copy number drivers such as MDM2, EGFR, CCNE1, MYC and ERBB2 were strongly positively associated with the amplicon signatures CN6–CN8 as well as CN17 (Extended Data Fig. 8i and Supplementary Table 8). By contrast, CDKN2A was the only homozygously deleted tumour suppressor gene associated with any signature, most commonly CN9.

We also explored the recent links between ancestry and HRD, genomic instability and chromothripsis52. The copy number signatures CN17 (HRD), CN6 and CN7 (chromothripsis) and some signatures with unknown aetiology were enriched in tumours of individuals with African ancestry (Extended Data Fig. 8j and Supplementary Table 8). We further associated tumour copy number signatures in people with Asian ancestry and found an enrichment of CN7 (Extended Data Fig. 8k and Supplementary Table 8), a chromothripsis pattern most frequently seen in breast cancers. In contrast to SBS and ID signatures10, no associations were found between any copy number signature and cancer risk factors such as sex, smoking status or alcohol consumption (Extended Data Fig. 8l and Supplementary Table 8). Significant associations were found between age and copy number signature attribution in endometrial cancer (Extended Data Fig. 8m and Supplementary Table 8); however, this was driven by subtype differences. That is, serous cancer versus endometrioid endometrial cancer (difference in mean age at diagnosis = 4.7 years, P = 8.99 × 10–5, two-sided Mann–Whitney test), in which non-endometrioid endometrial cancers are strongly associated with HRD53 and enriched in CN17 (OR = 13.6, P = 2.5 × 10–22, two-sided Fisher’s exact test).

Discussion

Here we presented a copy number signature framework that provides great utility for the exploration of copy number patterns across multiple cancer types and distinct experimental platforms and exceeds the capabilities provided by mutational signatures of substitutions, IDs or rearrangements. Signatures of substitutions and IDs have translational utility and have been identified across most cancer types and can be generally derived from WGS and, at much lower resolution, WES data54. Rearrangement signatures can only be derived exclusively from WGS data and cannot capture important prognostic information such as WGD. By contrast, this copy number signature framework can be applied across all cancer types, which enabled robust and consistent identification of copy number signatures from WGS, WES, reduced representation bisulfite sequencing (RRBS), SCS and SNP6 microarray data.

The identified copy number signatures hold clinical relevance with prognostic implications for patients in which amplicon signatures are observed (Extended Data Fig. 5a). Moreover, the identification of a copy number signature associated with HRD, although not the first such identification18, suggests that incorporating such signatures within existing bioinformatics tools for predicting HRD could further increase the accuracy of these tests55.

The field of copy number signatures is nascent, with multiple distinct methods previously implemented in distinct tumour types16,17,18,19,20. As the field matures, it will become increasingly clear which models are better suited to addressing specific clinical or biological questions. To resolve these questions, pan-cancer analyses that utilize all of these methods will be important, and we present here the first step towards that goal: a mechanism-agnostic pan-cancer compendium of copy number signatures derived from allele-specific profiles.

Methods

Utilized datasets

Using SNP6 microarray data, copy number profiles were generated for 9,873 cancers and matching germline DNA of 33 different types from TCGA6 using allele-specific copy number analysis of tumours (ASCAT)56 with a segmentation penalty of 70 (Supplementary Table 1). In addition, a set of whole-genome sequences from 512 cancers of the International Cancer Genome Consortium that overlapped with tumour profiles in TCGA were analysed33 to generate WGS-derived copy number profiles (see below). Last, a set of whole-exome sequences from 282 cancers from TCGA was analysed to generate exome-derived copy number profiles (see below).

Copy number profile summarization

Copy number segments were classified into three heterozygosity states: heterozygous segments with copy number of (A > 0, B > 0) (numbers reflect the counts for major allele A and minor allele B); segments with LOH with copy number of (A > 0, B = 0); and segments with homozygous deletions (A = 0, B = 0). Segments were further subclassified into five classes on the basis of the sum of major and minor alleles (TCN; Extended Data Fig. 1e) and were chosen for biological relevance as follows: TCN = 0 (homozygous deletion); TCN = 1 (deletion leading to LOH); TCN = 2 (wild type, including copy-neutral LOH); TCN = 3 or 4 (minor gain); TCN = 5–8 (moderate gain); and TCN ≥ 9 (high-level amplification). Each of the heterozygous and LOH TCN states were then subclassified into five classes on basis of the size of their segments: 0–100 kb, 100 kb–1 Mb, 1 Mb–10 Mb, 10 Mb–40 Mb and >40 Mb (the largest category for homozygous deletions was restricted to >1 Mb). This subclassification was used to capture focal, large-scale and chromosomal-scale copy number changes. In this way, copy number profiles were summarized as counts of 48 combined copy number categories defined by heterozygosity, copy number and size, which we defined as N = (n1,n2,…,n48). For a given dataset, the copy number profiles of a set with S samples were then summarized as a nonnegative matrix with S × 48 dimensions. The segment sizes were selected to ensure that a sufficient proportion of segments were classified in each category, which resulted in a reasonable representation across the pan-cancer TCGA dataset (Extended Data Fig. 1f–h). Two examples, representing a mostly diploid adrenocortical carcinoma (Extended Data Fig. 1i, j) and a copy number aberrant bladder cancer (Extended Data Fig. 1k–l), are provided to illustrate how the segments from a copy number profile are summarized by our framework into a vector of mutually exclusive and exhaustive quantitative features.

Deciphering signatures of copy number alterations

Copy number signatures were extracted by applying our previously developed approach for creating a reference set of signatures10. Specifically, SigProfilerExtractor (v.1.0.17)21 was applied to the matrix encompassing all TCGA samples, and separately to each matrix corresponding to an individual tumour type. In brief, SigProfilerExtractor utilizes nonnegative matrix factorization (NMF) to find a set of copy number signatures ranging from 1 to 25 components for each examined matrix. For each number of components, 250 NMF replicates with distinct initializations of the lower dimension matrices were performed on the Poisson resampled data. SigProfilerExtractor was used with default parameters, except for the initializations of the lower dimension matrices, for which random initialization was utilized consistent with our prior analyses of mutational signatures10,11. After performing 250 NMFs, SigProfilerExtractor clusters the factorization within each decomposition to automatically identify the optimum number of operative signatures that best explain the data without overfitting these data21.

As previously done10, the sets of all identified copy number signatures were combined into a reference set of pan-cancer copy number signatures by leveraging hierarchical clustering based on the cosine dissimilarities between each signature. The number of combined signatures is chosen to maximize the minimum average cosine similarity between each signature in a cluster and the mean of all samples in that cluster to ensure that each copy number signature in a cluster has a high similarity to the combined copy number signature for that cluster. Simultaneously, the maximum cosine similarity between mean copy number signatures for each cluster is minimized to ensure that each combined signature is distinct from all others. To avoid reference signatures being linear combinations of two or more other signatures, for each identified signature, a synthetic sample was created with the pattern of the signature multiplied by 1,000 copy number segments. Furthermore, the synthetic sample was resampled with probabilities proportional to the strength of each copy number category in each identified signature. Each resampling was then scanned for activity of all other signatures from the reference set. If a resampled sample can be reconstituted with a cosine similarity >0.95 by 3 or fewer other signatures, the signature used to create the synthetic sample was deemed to be a linear combination of those signatures, and the signature was removed from the global reference set of signatures.

Reference set of copy number signatures

Initially, 28 pan-cancer copy number signatures were derived from the different SigProfilerExtractor analyses of the 9,873 copy number profiles from SNP microarrays. In silico evaluation and manual curation showed that ten copy number signatures were linear combinations of two or more other signatures. Additionally, three signatures were deemed to be artefactual owing to oversegmentation of copy number profiles. These artefactual signatures were removed from further analyses, as were samples with any attribution of any of these artefactual signatures (116 samples; 1.2% of all TCGA samples). Moreover, samples with >25 Mb of homozygous deletions across the genome were removed from downstream analyses (58 samples), leaving 9,699 samples for full analysis. Following signature assignment (see below), three of the signatures that were removed owing to linear combination were re-extracted within tumour-type-specific assignment (cosine similarity = 1), which indicates that some copy number profiles could not be explained well without these three signatures. As a result, these 3 signatures were reintroduced into the compendium of signatures, leaving a total of 19 signatures. Last, it was observed that a number of samples with high amounts of LOH were poorly explained by the 19 signatures. To remedy this, signatures were extracted from all samples with a proportion of the genome LOH > 0.7. This extraction identified 3 new signatures that were incorporated into the reference set of signatures, giving 22 signatures. One of the newly identified LOH signatures was able to reconstitute 1 of the previous 19 signatures as a linear combination with another signature; therefore the linear combination LOH signature was removed from the reference set, leaving 21 non-artefactual pan-cancer signatures of copy number alteration.

CN1–CN3 form a group of ploidy-associated signatures. CN1 and CN2 display TCNs between 2 and 3–4 respectively, with predominantly >40 Mb heterozygous segments. CN3 consists of predominantly heterozygous segments of TCNs 5–8 with sizes >1 Mb.

CN4–CN8 form a group of amplicon-associated signatures that all have segment sizes predominantly between 100 kb and 10 Mb but with differing TCN or LOH states. CN4 consists of a mixture of LOH segments with a TCN of 1 and heterozygous segments with TCNs 3–4. CN5 consists almost entirely of LOH segments with a TCN of 2. CN6 consists of a mixture of LOH segments with a TCN of 2 and heterozygous segments with TCNs 3–4. CN7 consists of a mixture of heterozygous segments with TCNs of 3–4, 5–8 and 9+. CN8 consists of predominantly heterozygous segments with TCNs of 9+.

CN9–CN12 form a group of signatures with considerable LOH components. CN9 consists of a mixture of LOH segments with a TCN of 2 and heterozygous segments with a TCN of 2, each ranging from 100 kb to 40 Mb, which is suggestive of structural CIN. CN10 consists of a mixture of LOH segments with TCNs 2 and 3–4 and heterozygous segments with TCNs 3–4 between 100 kb and 40 Mb. CN11 consists of a mixture of LOH segments with TCNs 3–4 and heterozygous segments with TCNs 5–8, each at predominantly 1–10 Mb. CN12 consists of mostly LOH segments of a TCN of 2 with sizes >100 kb and additional heterozygous segments of TCNs 3–4 with sizes between 10 and 40 Mb.

CN13–CN16 form a group of signatures with whole-arm-scale or whole-chromosome-scale LOH events, a form of numerical CIN. CN13 is predominantly LOH TCN 1 segments, CN14 is LOH TCN 2 and CN15 is LOH TCN 3–4. CN16 consists of LOH segments with TCNs of 3–4 and heterozygous segments with TCNs of 5–8, each at >40 Mb.

CN17 has been associated with the tandem duplicator phenotype (Fig. 4). This signature consists of LOH segments of TCNs 2 and 3–4 and heterozygous segments of TCNs 3–4 and 5–8, each with segment sizes of 1–40 Mb.

CN18–CN21 originate from unknown processes and are diverse in their copy number patterns. CN18 consists of predominantly heterozygous segments of TCNs 4–8 at >1 Mb, but with appreciable contributions of LOH segments with TCNs 3–4 at >1 Mb and heterozygous segments with TCNs 9+ at >100 kb. CN19 consists of segments between 100 kb and 40 Mb that are heterozygous with TCNs 3–4 or less commonly LOH with a TCN of 1 or 2. CN20 consists of predominantly heterozygous segments with TCNs 3–4 at 100 kb–40 Mb with some heterozygous segments of TCNs 3–4 at 100 kb–10 Mb. CN21 consists of heterozygous segments with a TCN of 2 at >1 Mb and many heterozygous segments with TCNs 3–4 at 100 kb–1 Mb.

Assignment of copy number signatures to individual cancer samples

The global reference set of copy number signatures was used to assign an activity for each signature to each of the 9,873 examined samples using the decomposition module of SigProfilerExtractor21. For the assignment, the information of the de novo signature and their activities assigned to each sample were used to implement the decomposition module with default parameters, except for the NNLS addition penalty (nnls_add_penalty), which was set to 0.1, the NNLS removal penalty (nnls_remove_penalty), which was set to 0.01, and the initial removal penalty (initial_remove_penalty), which was set to 0.05. Signatures were assigned to samples in both tumour-specific evaluations and in a pan-cancer evaluation. As previously done10, the signature attributions from either tumour-specific or pan-cancer evaluations that gave the best cosine similarity between the input sample vector and the reconstructed sample vector were used as the attributions for that sample in all subsequent analyses.

Copy number signatures derived from WGS and WES data

A set of samples from TCGA with both SNP array and exome sequencing data were selected (n = 282). Copy number profiles were generated from the exome sequencing data using ASCAT across all of the dbSNP common SNP positions with a segmentation penalty ranging from 20 to 140. Signatures were re-extracted for these 282 samples from both the SNP-array-derived copy number profiles and the exome-derived copy number profiles, and the resulting signatures were compared.

For WGS data, we examined 512 whole-genome sequenced samples from the PCAWG project overlapping with TCGA samples with microarray data. Copy number profiles from WGS data were generated using ASCAT across the SNP6 positions, with a segmentation penalty ranging from 20 to 120. Signatures were extracted for samples with both SNP6-microarray-derived copy number profiles and the WGS-derived copy number profiles, and the extracted signatures were compared. In all cases, a segmentation penalty of 70 gave the best concordance for both copy number profiles and extracted copy number signatures based on SNP6 microarray, WGS and WES data.

Copy number signatures derived from different copy number callers

A set of 3,175 allele-specific copy number profiles called using the ABSOLUTE57 algorithm were obtained. Copy number signatures were extracted from the 3,175 ABSOLUTE profiles, as well as re-extracted for the 3,175 corresponding ASCAT profiles. Signatures were compared using cosine similarity with between 2 and 12 signatures extracted, and with the sigProfiler suggested solution of 4 signatures extracted.

Mapping copy number signatures to the landscapes of cancer genomes

See Supplementary Methods for details of mapping copy number signatures back onto the reference genome.

For all mapping analyses, P values were adjusted for multiple testing as appropriate for Monte Carlo testing58.

Associations between copy number signatures and events defined by genomic region

Localized events (chromothripsis33 and amplicon structure30) identified using WGS data were associated with mapped copy number signatures from TCGA for all available matching samples (chromothripsis n = 657; amplicon n = 1,703). Each segment in every sample was categorized as overlapping or non-overlapping of a localized event. For each copy number signature, the association was then tested using two-sided Fisher’s exact test on a contingency table of segments categorized as overlapping or non-overlapping of a localized event and assigned to or not assigned to the given copy number signature across all samples. Multiple-testing correction was performed using the Benjamini–Hochberg method.

Genome-doubled copy number signatures

With the copy number categories being defined as 0, 1, 2, 3–4, 5–8 and 9+, it is possible to artificially ‘genome double’ any copy number category, other than 0, by assigning it to the next highest copy number category. In this way, we artificially ‘genome doubled’ each signature by assigning the count for each copy number class to its next highest copy number class. First, the copy number 1 class is assigned a count of 0, then each copy number class is assigned the count of the preceding copy number class. For example, copy number class of 2 is assigned to the previous copy number class of 1, 3–4 assigned previous 2, and so on, until finally the copy number 9+ class is assigned a count that is the sum of the previous copy number 5–8 class and 9+ class. During this conversion, LOH and size categories were retained so that the only shift is in copy number. Having performed this conversion, cosine similarities between the artificially genome-doubled signatures and the original signatures were calculated. Any genome-doubled and original signature pair that had a cosine similarity of >0.85 was considered to contain a pair of signatures with analogous copy number patterns distinguished only by their genome-doubling status.

Associations between copy number signatures and ploidy

Ploidy for each copy number profile was calculated as the relative length weighted sum of TCN across a sample. The proportions of the genome that displayed LOH (pLOH) were also calculated. Samples with a ploidy above −3/2 × pLOH + 3, meaning an LOH-adjusted ploidy of 3 or greater, were deemed to be genome-doubled samples. By contrast, samples with a ploidy above −5/2 × pLOH + 5, meaning an LOH-adjusted ploidy of 5 or greater, were deemed to be twice genome-doubled samples. All other samples were considered as non-genome-doubled samples. Each signature (CN1–CN21) was associated with each genome doubling category (GD×0, GD×1 and GD×2) using one-sided Fisher’s exact test on a contingency table with samples categorized by whether the samples have >0.05 attribution to the given copy number signature or not, and whether the sample has the given genome doubled category or not. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Associations between copy number signatures and known cancer risk factors

Associations between attributions of copy number signatures and attributions of SBSs, IDs and doublet-base signature exposures10 were performed using Kendall’s rank correlation. Only the significant associations found in both cancer-type-specific and pan-cancer analysis are reported. For the cancer risk association analyses, copy number signatures were associated with sex59, tobacco smoking60 and alcohol drinking status61. For each copy number signature, the association was conducted using two-sided Fisher’s exact test on a contingency table of a clinical feature categorized as present or absent and assigned to or not assigned to the given copy number signature across all samples. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Associations between copy number signature attribution (binarized to present or absent) and the TDP (also binarized to present or absent)29 were performed using two-sided Fisher’s exact test (n = 882). This was performed for each copy number signature separately. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with q < 0.05 are reported.

Associations between copy number signature attribution (binarized to present or absent) and driver-gene single nucleotide variant (SNV) and ID mutation status40 were performed within tumour types using two-sided Fisher’s exact test (n = 6,543 across all cancer types). This was performed for all copy number signature/gene combinations for which the gene was mutated in the given cancer type and the copy number signature was observed in the given cancer type. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with both q < 0.05 and |log2(OR)|>1 are reported.

Driver copy number alterations of COSMIC cancer gene census genes62 were defined as follows: (1) homozygous deletion (CN = (0, 0)) of genes listed as deleted (D) in COSMIC mutation types; or (2) amplification (CN > 2 × ploidy + 1) of genes listed as amplified (A) in COSMIC mutation types. Associations were then performed on copy number driver alterations for SNV and ID driver gene alterations as outlined above (n = 9,699 across all cancer types).

The diversity of copy number signatures, as defined by Shannon’s diversity index, was associated with both SNV and ID and copy number driver gene mutations using a logistic regression model with binary diversity (>0, =0) as the dependent variable, and tumour type and gene mutation status as independent variables. LGG was taken as the reference tumour type. Only driver genes with >250 mutant samples in the dataset were included in the model.

Associations between copy number signature attribution (binarized to present or absent) and age at diagnosis (binarized to above or below median separately for each cancer type) were performed within cancer types using two-sided Fisher’s exact test (n = 8,841 across all cancer types). All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method, and only associations with both q < 0.05 and |log2(OR)|>1 are reported.

Leukocyte counts were obtained from TCGA50. The leukocyte fraction was associated with copy number signatures using a logistic regression model with binarized leukocyte fraction (fraction > or ≤ median fraction) as the dependent variable, and binarized copy number signature attribution (0, >0 attribution) and ASCAT estimated tumour purity as independent variables. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Copy number signatures and defective HR

Signatures were tested for enrichment in tumour types using one-sided Mann–Whitney tests of signature attribution in a given tumour type versus all other tumour types. This was performed for all signature and tumour combinations. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

The following core HR repair pathway member genes were chosen for interrogation: BRCA1, BRCA2, RAD51C and PALB2 (refs. 63,64). Copy number alterations across these genes were identified based on ASCAT copy number profiles for homozygous deletions (that is, CN = (0, 0)) and LOH (that is, CN = (>0, 0)). Somatic SNVs and IDs were taken from ref. 40. Pathogenic germline variants in BRCA1 and BRCA2 were taken from ref. 65. Samples were deemed as bi-allelically mutated for the HR pathway if homozygously deleted or if more than one of any of the other classes of alteration were present within any of the HR pathway genes. Mono-allelic loss was defined as one of any of the non-homozygously deleted alterations within any of the HR pathway genes. Wild type was defined as no alterations in any HR pathway genes. The associations between HR pathway status and CN17 were then restricted to only breast (n = 589), ovarian (n = 309) and pan-cancer (n = 4,919). Two-sided Fisher’s exact tests were performed between wild-type and mono-allelic samples, between wild-type and bi-allelic samples, and between mono-allelic and bi-allelic HR pathway status samples. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

A further multivariate logistic regression model was utilized with CN17 attribution (>0 or 0) as the dependent variable, and BRCA1, BRCA2, RAD51C, PALB2, FBXW7, CDK12 mutational status, categorized as wild type, mono-allelic or bi-allelic as previously described, as independent variables, to test associations between the mutation status of individual HR pathway genes and CN17.

Orthologous scores of HRD were calculated using scarHRD61. Associations between scarHRD scores and CN17 were tested using two-sided Fisher’s exact tests, with CN17 categorized as present or absent, and scarHRD scores categorized as positive or negative around thresholds of both 42 (which has been described as an adequate threshold in breast cancer61) and 63 (which has been described as an adequate threshold in ovarian cancer66). Furthermore, we associated the presence or absence of CN17 with continuous scarHRD scores using two-sided Mann–Whitney test.

To test associations between promoter hypermethylation of the HR machinery and CN17, TCGA methylation β values were downloaded from https://portal.gdc.cancer.gov/ and TCGA-normalized gene expression RSEM values were downloaded from https://gdac.broadinstitute.org/

Relationships between log10(RSEM) values and mean TSS200 and TSS1500 associated methylation probe β values were initially inspected in breast cancer to determine a threshold mean β value for determining promoter hypermethylation and subsequent epigenetic silencing of BRCA1. This threshold was set at mean β > 0.7.

CN17 attribution was associated between BRCA1 promoter hypermethylated breast cancer samples and both genomic BRCA1 wild-type and bi-allelically mutated breast cancer samples using two-sided Mann–Whitney test. This analysis was extended to a pan-cancer association, performing two-sided Fisher’s exact tests between signature attribution or not, and promoter hypermethylation (mean TSS200 and TSS1500 β > 0.7) or hypomethylation (mean TSS200 and TSS1500 β ≤ 0.7). P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with hypoxia

Gene-expression-derived scores of hypoxia from 8,006 TCGA tumours were used49,67. A linear regression with hypoxia score as the dependent variable, and binarized copy number signature attributions (>0, =0) as well as tumour type as independent variables.

Copy number signatures associated with complex rearrangements

Assignment of rearrangement phenomena to PCAWG samples were used31. Associations of each re-arrangement phenomenon with each copy number signature were evaluated using two-sided Fisher’s exact tests of copy number signature non-attributed or attributed (=0, >0) against rearrangement phenomenon presence or absence. P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with HPV in HNSC

We used HPV testing status from TCGA HNSCs obtained from ref. 68. HPV status was associated with copy number signature attribution using two-sided Fisher’s test. P values were corrected for multiple testing using the Benjamini–Hochberg method. Furthermore, hypoxia scores (see above) were associated with HPV status using two-sided Mann–Whitney test.

Copy number signature associated with ethnicity

Ethnicity information for 11,160 individuals from TCGA was taken from the TCGA Clinical Data Resource59. Copy number signatures (binarized to present/absent) were associated between Black/White ethnicity and between Asian/White ethnicity separately using two-sided Fisher’s exact tests. P values were corrected for multiple testing using the Benjamini–Hochberg method.

Copy number signatures associated with changes of overall survival

Survival data for 11,160 individuals from TCGA were obtained from the TCGA Clinical Data Resource59. Univariate disease-specific survival analysis for signatures was performed using a log-rank test and Kaplan–Meier curves in R, with groups being unattributed (attribution = 0) and attributed (attribution > 0) for each signature separately, or for summed attributions of a set of signatures (for example, amplicon signatures).

Multivariate disease-specific survival analysis was performed using the Cox’s proportional hazards model in R with Boolean attributed/non-attributed variables for each copy number signature and tumour type as covariates. To account for potential violations of Cox’s model’s proportional hazards assumption, we also conducted the same analysis using the accelerated failure time model with the Weibull distribution using the flexsurvreg function in R. All P values were corrected for multiple hypothesis testing using the Benjamini–Hochberg method.

Simulating copy number profiles

See Supplementary Methods for details of the methods used to simulate copy number profiles from various processes.

Single-cell isolation, FACS analysis and DNA library generation for USARC ploidy estimation

Fresh frozen tumour tissue was thawed on ice, dissected and homogenized with 500 µl of lysis buffer (NUC201-1KT, Sigma). Following the release of single nuclei, samples were centrifuged, and the resulting precipitate removed. A 10 µl sample was taken to count and evaluate the extracted nuclei. The lysate was cleaned using a sucrose gradient following the manufacturer’s instructions (NUC201-1KT, Sigma). After cleaning, the nuclei were centrifuged at 800g for 5–10 min at 4 °C and resuspended in PBS, supplemented with 140 µg ml–1 RNase (19101 Qiagen) and stained with 1 µg ml–1 DAPI (Sigma-Aldrich), and 2.5 µg ml–1 Ki-67 antibody (BioLegend) per 1 million cells in 100 µl. Stained nuclei were analysed using a FACS Aria Fusion cell sorter (BD bioscience) and FACS DIVA software (v.8.0.1). Cells were sorted using a 130-μm nozzle with 12 psi set for sheath pressure. Each gated population of interest was collected into a separate 1.5-ml tube, and a custom sort precision of 0-16-0 (Yield-Purity-Phase) was used. For cells collected into plates, the sort precision used was Purity, defined as 32-32-0 (Yield-Purity-Phase). DAPI was measured using a 355-nm UV laser with a 450/50 bandpass filter. Ki-67 was measured using a 635-nm red laser with a 670/30 bandpass filter. Forward scatter and side scatter were both measured from a 488-nm blue laser on a linear scale. DAPI was also measured on a linear scale and was used to estimate the DNA content per single cell. A control diploid cell line was used to establish accurate ploidy measurements before sorting. Forward versus side scatter area was used to exclude debris, whereas the height versus area of the DAPI fluorescence was used to exclude doublets. FACS analysis revealed the presence of three major aberrant cell populations (Supplementary Methods), including a haploid population (1n), a nearly diploid population (2n, Ki-67 positive) and a WGD population (3n+). A non-proliferating, non-aberrant, normal cell population was also identified (2n, Ki-67 negative).

Once sorted, single nuclei suspensions were processed using a Chromium Single Cell DNA Library & Gel Bead kit (10X Genomics, PN-1000040) according to the manufacturer’s instructions, with a target capture of 1,000–2,000 cells. The resulting barcoded single-cell DNA libraries were sequenced with an Illumina HiSeq 4000 system using 150 bp paired-end sequencing with a coverage ranging from 0.01 to 0.08 X per cell. Germline bulk WGS was also performed on a XTen instrument (Illumina) as previously described16. Copy number signatures were also evaluated in single cells harbouring chromothripsis, as well as WGD events using sequencing data that had already been generated from a cell-based model system linking chromothripsis and hyperploidy69.

Single-cell allele-specific copy number alteration calling using ASCAT.sc

USARC single-cell paired-end reads generated using the chromium single cell CNV platform were processed using the 10X Genomics Cell Ranger DNA Pipelines (https://support.10xgenomics.com/single-cell-dna/software/pipelines/latest/what-is-cell-ranger-dna). Following sample demultiplexing, data were aligned to the GRCh38 reference genome and a barcoded BAM file was obtained for every considered single cell per individual USARC ploidy population. To analyse each barcoded BAM file and derive total copy number alterations for each single cell, we then applied ASCAT.sc v.1.0 (https://github.com/VanLoo-lab/ascat), our in-house pipeline, to analyse single-cell and shallow coverage WGS data. Similar to its predecessor ASCAT, which measures allele-specific copy number alterations in bulk tumour data56, ASCAT.sc infers single-cell TCN states from changes in the relative read depth (logR). Importantly, ASCAT.sc derives the logR from the number of reads aligning in different genomic bins, unlike ASCAT, which relies on both the logR and the allelic imbalance (otherwise known as the B-allele frequency) at SNP loci identified as heterozygous in the germline. Thus, ASCAT.sc utilizes logR shifts to segment the genome into regions with constant TCN states, thereby assigning integer copy number profiles to single cells. For single-cell allele-specific copy number alterations, we first performed single-cell segmentation using multiple piecewise constant fitting70 using the R package copynumber v.1.26.0 (https://bioconductor.org/packages/release/bioc/html/copynumber.html). We then provide ASCAT.sc with the available matched-normal germline sample and generate phased germline SNPs using Beagle (v.5.1)71 as part of the subclonal copy number calling pipeline, Battenberg72. ASCAT.sc then uses single cell logR values alongside phased SNP data, as well as allele counts for heterozygous SNPs (generated using alleleCount; https://github.com/cancerit/alleleCount) to calculate allele-specific copy number alterations in single cells. These results can be used to group cells into distinct tumour subclones while also excluding noisy single cells.

Copy number signatures on single-cell copy number profiles

For all single-cell datasets, adjacent genomic bins within a chromosome with the same major and minor copy number were combined into a single segment. Genomic bins for which no copy number state was assigned were removed from the profiles. Copy number summaries were then generated, and TCGA copy number signatures were scanned using sigProfilerSingleSample on all cells.

Because of the nature of the undifferentiated sarcoma for which single-cell sequencing was performed (near-genome-wide LOH), the majority of the genome should be LOH for tumour cells, and a minority of the genome should be LOH for normal cells. However, we observed a number of cells for which the majority of the genome had a copy number of (1, 4). This is an erroneous copy number pattern, which occurred owing to the difficulty of calling LOH from single-cell data in the context of multiple genome-doubling events. Cells with a proportion of the genome LOH < 0.4 and a proportion of the genome with imbalanced copy number (major CN!=minorCN) > 0.6 were excluded from further analysis to remove erroneous profiles.

For an assessment of copy number signatures in genomically unstable single cells, BAM files from TP53 mutant RPE1 cells were downloaded69. Copy number profiles were generated as for the USARC single cell data, and scanned for signatures using sigProfilerSingleSample.

FACS and copy number profiling of ploidy populations for RRBS

The sorting strategy for RRBS workflows was modified to collect groups of cells belonging to different ploidy populations based on DAPI staining (Supplementary Methods). Five tumour samples were processed in this manner, DNA was extracted using a Quick-DNA Miniprep Plus kit (Zymo, D4068) and library preparation and quality control was performed using an Ovation RRBS Methyl-Seq system (Nugen, 0353, 0553) according to the manufacturer’s instructions. Paired-end sequencing was performed on an Illumina NovaSeq instrument using an S1 flowcell 100 cycles (single end). Allele-specific copy number calling was performed using CAMDAC (https://github.com/VanLoo-lab/CAMDAC).

Copy number signatures for the 4 ploidy-sorted populations and the bulk population were extracted using sigProfilerExtractor, setting the number of signatures to extract at 4. Artificial genome-doubling of the identified signatures was performed as described above. The 5 samples were also scanned for the 21 TCGA signatures using sigProfilerSingleSample; identified copy number signatures were categorized by their predominant genome-doubling association (see above), and the prevalence of individual genome doubling category (WGD×0, WGD×1, WGD×2) signatures was evaluated.

Copy number signatures in germline TP53 mutant cancers

We used Battenberg-derived72 copy number profiles of WGS data from cancer samples of patients with Li–Fraumeni disease73,74. Additional clinical metadata and highly curated sequencing data for additional cases were obtained from D.M., A.S. and N.L.

Data analysis

All signatures decompositions, assignments and matrix generations were performed using the sigProfiler suite (see above) of Python packages using Python v.3.7.1.

All statistical analyses were performed in R v.4.0.2. Plotting was performed with base R or with packages ggplot2, ggrepel, RColorBrewer, circlize, ComplexHeatmap, colorspace, seriation, dendextend, beanplot and corrplot. Survival analysis was performed with the R packages survival and survminer. Multiple testing correction was performed using qvalue. Cosine similarities were calculated using the cosine function from lsa. TSNE analysis was performed using Rtsne. Data handling was performed with GenomicRanges, tidyr, stringr, parallel and gtools.

Ethics

Informed consent from patients and ethical approval for tissue biobanking was obtained through the UCL/UCLH Biobank for Studying Health and Disease (REC reference: 20/YH/0088; NHS Health Research Authority). Approval for the study and ethics oversight was granted by the NHS Health Research Authority (REC reference: 16/NW/0769).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.