Single-cell genomic variation induced by mutational processes in cancer

How cell-to-cell copy number alterations that underpin genomic instability1 in human cancers drive genomic and phenotypic variation, and consequently the evolution of cancer2, remains understudied. Here, by applying scaled single-cell whole-genome sequencing3 to wild-type, TP53-deficient and TP53-deficient;BRCA1-deficient or TP53-deficient;BRCA2-deficient mammary epithelial cells (13,818 genomes), and to primary triple-negative breast cancer (TNBC) and high-grade serous ovarian cancer (HGSC) cells (22,057 genomes), we identify three distinct ‘foreground’ mutational patterns that are defined by cell-to-cell structural variation. Cell- and clone-specific high-level amplifications, parallel haplotype-specific copy number alterations and copy number segment length variation (serrate structural variations) had measurable phenotypic and evolutionary consequences. In TNBC and HGSC, clone-specific high-level amplifications in known oncogenes were highly prevalent in tumours bearing fold-back inversions, relative to tumours with homologous recombination deficiency, and were associated with increased clone-to-clone phenotypic variation. Parallel haplotype-specific alterations were also commonly observed, leading to phylogenetic evolutionary diversity and clone-specific mono-allelic expression. Serrate variants were increased in tumours with fold-back inversions and were highly correlated with increased genomic diversity of cellular populations. Together, our findings show that cell-to-cell structural variation contributes to the origins of phenotypic and evolutionary diversity in TNBC and HGSC, and provide insight into the genomic and mutational states of individual cancer cells.

How cell-to-cell copy number alterations that underpin genomic instability 1 in human cancers drive genomic and phenotypic variation, and consequently the evolution of cancer 2 , remains understudied. Here, by applying scaled single-cell whole-genome sequencing 3 to wild-type, TP53-deficient and TP53-deficient;BRCA1-deficient or TP53-deficient;BRCA2-deficient mammary epithelial cells (13,818 genomes), and to primary triple-negative breast cancer (TNBC) and high-grade serous ovarian cancer (HGSC) cells (22,057 genomes), we identify three distinct 'foreground' mutational patterns that are defined by cell-to-cell structural variation. Cell-and clone-specific high-level amplifications, parallel haplotype-specific copy number alterations and copy number segment length variation (serrate structural variations) had measurable phenotypic and evolutionary consequences. In TNBC and HGSC, clone-specific high-level amplifications in known oncogenes were highly prevalent in tumours bearing fold-back inversions, relative to tumours with homologous recombination deficiency, and were associated with increased clone-to-clone phenotypic variation. Parallel haplotype-specific alterations were also commonly observed, leading to phylogenetic evolutionary diversity and clone-specific mono-allelic expression. Serrate variants were increased in tumours with fold-back inversions and were highly correlated with increased genomic diversity of cellular populations. Together, our findings show that cell-to-cell structural variation contributes to the origins of phenotypic and evolutionary diversity in TNBC and HGSC, and provide insight into the genomic and mutational states of individual cancer cells.
The identification and characterization of endogenous mutational processes 4-6 have transformed our understanding of cancer genomes 6-11 , and have led to improved prognostic and therapeutic stratification of cancers with genomic instability 12-14 . However, mutational processes are typically inferred from bulk whole-genome sequencing (WGS), which yields aggregate signals from pools of DNA composed of millions of cells. Thus, contemporaneous post-mitotic cell-to-cell variation due to genomic instability is not detectable in bulk sequencing, and has been understudied. Single-cell WGS can readily decompose clone-specific and cellular genomic events 2,3,15 , enabling the calculation of copy number alteration (CNA) and structural variation (SV) accrual rates and mutational patterns over thousands of individual cells. This allows for the separation of evolutionary vestigial events, which are present in initial clonal expansions, from contemporaneous 'foreground' events, which reflect ongoing mechanisms of cell-to-cell genomic diversification. For example, breakage-fusion-bridge cycles (BFBCs) and homologous recombination deficiency (HRD) are endogenous mutational processes that accrue SVs with specific patterns including tandem duplications, interstitial deletions and fold-back inversions (FBIs) that generate high-level copy number amplifications 5,10,12,14 . Because HRD and BFBCs are predicted to induce cell-specific structural changes on individual maternal or paternal alleles, a haplotype-specific analysis is essential for a comprehensive account of genome-scale structural variation. Here we combine single-cell approaches with haplotype-specific analysis to reveal how different mutational processes diversify the genomes of individual cancer cells and thereby determine phenotypic variation and evolutionary selection in human tumours. We apply scaled single-cell WGS and haplotype-specific analysis to an in vitro cell line system with experimentally induced HRD-associated genomic instability and human breast and ovarian tumours defined by SV-associated mutational processes 6,10,12,16,17 . Our study reveals three sources of cell-to-cell variation in cancer genomes, with implications for interpreting phenotypic diversity and evolutionary selection in cancers with genomic instability.

Induced single-cell genomic instability
We first developed a combined experimental and computational approach for studying genome-scale cell-to-cell variation in human cells, by establishing an in vitro isogenic system of breast epithelium with induced HRD and defined temporal passaging. We generated TP53 (ref. 18 Table 1). We then subjected these cells to tagmentation whole-genome single-cell sequencing (DLP+), which enables scaled analysis of each population and inference of cell-specific rates of structural alterations 3 . In addition, we developed a computational method called SIGNALS, a hidden Markov model (HMM) which phases copy number events to individual homologues 20 in single-cell genomes to quantify haplotype-specific CNA as a source of cell-to-cell variation. SIGNALS was benchmarked on the ovarian cancer cell line OV2295, and when evaluated across different technologies and tumour types showed increased genomic and cellular resolution (0.5 Mb) compared with previously published methods 21,22 , identified cell-to-cell diversity that would be unclear when relying on total copy number, and exhibited the expected distributions of phased somatic point mutation variant allele fractions (VAFs) resulting from haplotype-specific gains and losses (Extended Data Fig. 3 and Supplementary Note).
We next tested whether haplotype-specific analysis at single-cell resolution could identify properties of mutational processes (Fig. 2). BFBC processes induce segmental amplifications adjacent to terminal losses on the same homologue, staircase-like copy number patterns and clustered FBI breakpoints [26][27][28] (Fig. 2b-d and Extended Data Fig. 5a). Using haplotype-specific alterations, we identified subclonal and variable amplitude high-level amplifications (HLAMPs, defined by 10 or more copies). HLAMPs were rare in the wild-type setting but increased with TP53 loss of function, and further increased with BRCA1 or BRCA2 loss of function (Fig. 2a). Notably, some HLAMPs were consistent with BFBCs, and affected known oncogenes including MYC (SA1188, Fig. 2b; SA906a, Fig. 2c) and PIK3CA (SA1054, Fig. 2c). An early passage of SA1188 BRCA2 +/− (1,395 cells; Extended Data Fig. 5a,b) exhibited hallmark patterns of BFBCs on chr. 3q through the presence of extant cells mapping to expected stepwise stages of progression with successive cell divisions. This included clusters of cells with reciprocal gains and losses, clusters in which the loss was extended and clusters in which a segmental amplification was adjacent to a terminal loss, including examples of cells with PIK3CA amplification (Extended Data Fig. 5c-f). Thus, an in vitro system characterized by population-scale single-cell sequencing revealed specific cell divisions that generated cell-to-cell variation in the amplitude and genomic structure of HLAMPs.
We then quantified the extent of parallel HSCN alterations, whereby cells with an identical total copy number at a given locus were composed of subsets of cells segregated by altered maternal or paternal alleles 29 (Fig. 2e,f). The rates of parallel losses and gains were increased in TP53 −/− cells relative to wild-type cells, and were further increased in BRCA1 −/− and BRCA2 −/− populations (Fig. 2g). Notably, the parallel events affected transcriptional phenotypes that resulted from the loss of either allele A or allele B; chr. 2q in SA906b provides an example (Extended Data Fig. 5g,h). Chr 2q losses in matched single-cell RNA sequencing (scRNA-seq) data were readily identified by SIGNALS (Supplementary Note) and cells with the loss of allele A or B clustered together in gene expression space (Extended Data Fig. 5i). The nearest neighbours of monosomic 2q cells in scRNA-seq were equally enriched for losses of both A and B alleles (Extended Data Fig. 5j), suggesting that maternal and paternal allelic losses converge on a common transcriptional phenotype.
In addition to multi-allelic variation, we observed extensive cellto-cell variation in the genomic locations of breakpoints of CNA events. The precise boundaries of CNAs from cell to cell yielded a pattern that we term 'serrate structural variation' (SSV) (Fig. 2h), which consists of a modal breakpoint across cells, with 'tails' that reflect either a progressive accumulation or 'erosion' away from the modal breakpoint. The aggregate, consensus copy number profiles over cells across the entire SSV regions (analogous to what would be seen in bulk sequencing libraries), revealed sloping copy number changes between integer values, indicative of an averaged signal with underlying variance (Fig. 2h). In some cases, these events were restricted to a single allele (for example, SA906a chr. 19), whereas in others, both alleles were implicated (for example, SA906b chr. 2). Further DLP+ sequencing from serial passaging 18 of these cells (an additional 7,793 cells), indicated that SSV events distributed across serial passaging, consistent with an ongoing mutational process (Fig. 2h).
In summary, the induction of genomic instability in breast epithelium yielded progressively higher rates of genomic divergence between individual cells, measurable as rate distributions with scaled single-cell WGS and cell-specific CNAs. The resulting 'foreground' cell-to-cell variation could be further characterized as clone-and cell-specific Article HLAMP, parallel allele alteration and serriform patterns of copy number breakpoints in the cellular population.

Cell-level CNA variation in HGSC and TNBC
On the basis of observing foreground mutational patterns defined by cell-to-cell variation, we next asked how the foreground event types distributed as a function of HRD and non-HRD mutational processes in TNBC and HGSC cancers. To identify appropriate patient tumour samples for this comparison, we first constructed a 'meta-cohort '

Article
Tables 2 and 3). Single-nucleotide variant (SNV) and SV mutational signature profiles that were inferred from DLP+-derived pseudobulk from the PDXs clustered with their bulk WGS counterparts (Extended Data Fig. 8a), indicating consistent mutational signature types without significant distortion of the signals from the original source tumour.
In addition, SIGNALS analysis from DLP+ showed that the proportion of the genome identified as homozygous was highly correlated with bulk sequencing (R = 0.9, P < 0.001; Extended Data Fig. 8b), and that VAFs of somatic mutations were distributed as expected (Extended Data Fig. 8c), indicating accurate single-cell HSCN inference. Cellular copy number profiles revealed extensive subclonal heterogeneity and cell-to-cell variation in both HRD and FBI tumours (Fig. 3a,b and Extended Data Fig. 8d). However, FBI cells exhibited higher overall rates of polyploidy ( Fig. 3c; P = 0.02) and chromosomal missegregation relative to HRD-Dup ( Fig. 3d; P = 0.0015). In addition, FBI tumours accrued gains at a significantly higher rate than did HRD-Dup tumours, with more skewing of the gain/loss ratio (4.9 versus 2.1, P = 0.04; Fig. 3e and Extended Data Fig. 8e,f). This was more pronounced when considering the baseline ploidy of the tumours (P = 0.0012; Extended Data Fig. 8g). Indeed, higher rates of polyploidy and segmental copy number gains may provide a greater opportunity for-and greater tolerance of-the large interstitial deletions that are found in some FBI cancers (Extended Data Fig. 7b). Pairwise HSCN distances between cells, reflecting cell-population diversity, yielded highly variable distributions across samples, ranging from a median value of 2 for the diploid TD case SA1047 to more than 123 for the pentaploid FBI case SA604   e, Ratio of chromosomal gains versus losses across different ploidy states and mutational signature groupings (number of cells shown below each violin plot). All box plots indicate the median, first and third quartiles (hinges), and the most extreme data points no farther than 1.5× the IQR from the hinge (whiskers).
(Extended Data Fig. 8h). FBI tumours were more diverse than HRD-Dup or TD samples, with average HSCN distances of 71 (FBI), 46 (HRD-Dup) and 26 (TD) (P = 0.047 FBI versus HRD-Dup, P = 0.031 FBI versus TD; Extended Data Fig. 8i). Thus, considering whole-genome duplication, overall rates of segmental aneuploidy and the gain/loss ratio, FBI and HRD-Dup tumours showed markedly different patterns of CNA accrual at the single-cell level.

HLAMP amplitude varies within FBI tumours
Next, we determined whether the CNA patterns that gave rise to single-cell variation in the cell lines could also explain cell-to-cell variation in the amplitude of HLAMPs in the tumours. We found extensive heterogeneity in the amplitude of HLAMPs across clonal populations within tumours, that would otherwise be obscured in bulk sequencing. By first focusing on a specific example, we assessed the phenotypic effect of a clone-specific HLAMP in the KRAS locus, present with average copy number 16.1 in a clone with 55 cells relative to a sibling clone composed of 230 cells that lacked the amplification (Fig. 4a). KRAS was differentially expressed between cell clusters from matched scRNA-seq data (maximum log-transformed fold change (logFC) = 0.346, q < 0.05; Fig. 4b), and immunohistochemistry for KRAS both in tissue from the primary patient and in PDX tissue corroborated a punctate pattern of expression across spatially separated regions within tumour sections (Fig. 4c). Thus, in a specific example, clone-specific HLAMP of an oncogene in a minor clone-otherwise not detectable with bulk methods-revealed co-associated clone-specific phenotypic variation. Across the dataset as a whole, FBI tumours had a 1.9-fold higher median HLAMP copy variance than did the other tumours (P = 0.00096; Fig. 4d,e), consistent with continual plasticity of HLAMP amplitude as a general property of FBI. Most events were less than 10 Mb in width (56%; Fig. 4f) and exhibited a distribution of maximum observed copy number with median 16.1 and IQR 8.7 (Fig. 4g). Furthermore, we noticed that amplitude variation in HLAMPs affected numerous other known oncogenes, including ERBB2 (DG1197), KIT (DG1197), KRAS (SA1049 and SA604), MYC (SA1184 and SA1051), CCNE1 (DG1134, SA1162 and SA604) and FGFR1 (SA1049 and SA535) (Fig. 4h). Notably, oncogenes with a variable copy number between cells and clones also exhibited greater variability in gene expression than did other genes, as measured by matched scRNA-seq (P = 0.012; Fig. 4i).
To determine the structural processes that lead to these events, we found that the rearrangement properties of variable HLAMPs were enriched for FBIs, consistent with BFBCs being a central mechanism of variable HLAMPs in FBI tumours. We also found clusters enriched for simple tandem duplications driving variable HLAMPs in HRD-Dup tumours (Methods and Extended Data Fig. 9a), providing further evidence for the different aetiological origin of these events in FBI and HRD-Dup tumours. This analysis also revealed that in many cases, clone-specific HLAMPs were part of complex genomic structures involving multiple chromosomes. For example, variable amplitude around the CTNNB1 locus in SA1096 coincided with a translocation between chr. 3 and chr. 6 ( Fig. 4j). Long-read single-molecule nanopore sequencing 36  Thus, cell-to-cell variability in HLAMP-which is not observable with bulk sequencing-is a pervasive mutational pattern that is most pronounced in FBI tumours, and consists of clone-specific complex rearrangements that influence phenotype through variable oncogene expression.

Haplotype-specific parallel evolution
We next investigated the extent of haplotype-specific parallel copy number evolution in tumours (Fig. 5). Phylogenetic tree analysis using breakpoints inferred from total copy number across the whole genome 18,37 (see Methods) revealed that in some cases, alleles segregated into distinct clades on the tree; for example, gains of 1q in SA1049 (Fig. 5a) and losses at the terminal end of chr. 10 in SA1053 (Fig. 5b). In other cases, gains and losses of different alleles were sporadic and were distributed more randomly across the tree, such as chr. 8 in SA1093 (Fig. 5c). Parallel copy number events were validated using VAFs of mutations found in these regions, in which-as expected-the VAF distribution inverted between two expected values, depending on allelic composition (Fig. 5d). We contend that in bulk sequencing, represented here by pseudobulk with mixtures (see Methods), the computed VAF no longer reflected the underlying copy number state in a heterogeneous mix of cells (Fig. 5e). We therefore suggest that accurate cancer cell fraction (CCF) inference, which depends on accurate VAF values, may be challenging in tumours with parallel copy number evolution.
We confirmed that parallel CNAs influence transcription with matched scRNA-seq. Inactivation of TP53 is invariably mediated by LOH of chr. 17 in these cancer types, and chr. 17 was indeed mono-allelically expressed across all tumours-in contrast to the hTERT wild-type cell line, which was used here as a control population (Fig. 5i). In addition, genes located at the terminal end of chr. 10 in SA1053 (Fig. 5b), were mono-allelically expressed in 100% of cells, with one cluster of cells expressing the B allele and another group of cells expressing the A allele ( Fig. 5f,g). Across all data with matched scRNA-seq, mean BAF values per segment per sample measured in single-cell DNA sequencing (scDNA-seq) were strongly correlated (R = 0.91, P < 10 −5 ) with those measured in scRNA-seq (Fig. 5h), consistent with allele bias at the DNA level translating to consequent allele bias in expression.
Notably, nearly all tumours exhibited parallel CNA evolution. We classified genomic segments as parallel CNAs if more than 1% of cells had gain or loss of both the A and B alleles and assigned the clonality using total copy number as follows: clonal (CCF > 80%, as in Fig. 5a,b), subclonal (20% < CCF ≤ 80%) or rare (CCF ≤ 20%, as in Fig. 5c). Every tumour had at least one parallel CNA event, with most containing parallel CNAs at different clonalities (Fig. 5j). Across all samples, an average of 6% of clonal segments, 15% of subclonal segments and 7% of rare segments contained parallel CNAs, with a trend for higher event rates in FBI relative to HRD-Dup (Extended Data Fig. 8j,k; P = 0.02 for subclonal, P > 0.05 for clonal and rare). Motivated by the sporadic pattern of losses of both alleles observed in Fig. 5c, we then tested whether parallel CNAs due to losses were more common on a tetraploid versus a diploid background, using ancestral state reconstruction to estimate the event rate across the phylogenetic tree (see Methods). We found that on a diploid (1|1) background, parallel gains were more common than losses, but that on a tetraploid (2|2) background, parallel losses became more common than gains. This was true for whole chromosome, chromosome arm and segmental aneuploidies (Fig. 5j). The number of parallel CNAs was significantly correlated with both copy number and phylogenetic distances computed using total copy number (Fig. 5k,l). Thus, parallel copy number evolution was a pervasive feature, affecting the interpretation of somatic mutations, haplotype-specific expression and overall levels of genomic diversity in TNBC and HGSC tumours.

Increased CNA serriformity in FBI tumours
SSVs first identified in cell lines from single-cell WGS represent a structural mutation type that is not identifiable using bulk WGS, as the cell-to-cell variation in copy number breakpoints is obscured. We analysed the tumour DLP+ data for the presence of SSVs ( Fig. 6a; heat maps 2-5 from left). SSVs, occurring at a megabase length scale, were distinct from small cell-to-cell variations in copy number breakpoint Article localization, which may occur owing to fluctuations in sequencing coverage rather than true changes in copy number (for example, Fig. 6a; invariant copy number heat map boundaries, heat map 1 from left). SSVs were also visible in single cells comprising the serration pattern (Fig. 6b). Additional confirmation of the SSV scale was obtained from allele phasing, in which the concomitant loss of heterozygosity was observed (Fig. 6b, bottom track). We computed serration scores per breakpoint event to identify the relative degree of variation in breakpoints across cells in each cancer. Scores were calculated as the fraction of event-containing cells with rare (less than 5% of cells) event breakpoint positions (see Methods) and with breakpoint regions that met size and prevalence criteria (≥20 Mbp, that is, 40 genomic bins; ≥100 cells with breakpoint event) to permit the detection of positional and cell-to-cell variation. Variable cell-to-cell breakpoints were common, with 6.6% of regions having serration scores of 0.15 or higher (that is, 15% of cells or more have a rare event breakpoint position) across all cases, with FBI cases having the highest (12.1%), HRD-Dup cases the lowest (1.2%), and TD cases intermediate (10.5%) rates. Comparison of distributions of serration using a mixed effects linear model accounting for individual variation indicated that FBI cases had higher degrees of breakpoint variance in breakpoint regions (P = 0.0081; Fig. 6c,d). We further observed that serration scores increased in cases with more polyploid cells (R = 0.68, P = 0.001; Fig. 6e), and as a function of cell-to-cell HSCN distance (R = 0.62, P = 0.0033; Fig. 6f), implicating SSVs as an additional genome-diversifying mechanism in TNBC and HGSC cancers.

Discussion
Our findings show that cell-to-cell variation at the level of structural and copy number alteration is a pervasive 'foreground' feature of TNBC and HGSC cancers that is exhibited against distinct endogenous mutational processes of genomic instability. Because CNAs can influence the expression levels of hundreds of genes, each of the foreground mutational patterns provides extensive and distinct genomic diversity upon which selection may act. Oncogenic HLAMPs are understood to be key drivers of tumour progression and are prognostic in HGSC when co-localized with FBIs 12 . Here we reveal an additional layer of complexity, finding that the amplitude of HLAMPs can vary substantially between cells. Although this has been recognized as a defining feature of extrachromosomal DNA amplifications 38 , we propose that it is also a general property of other classes of HLAMPs, such as those mediated by BFBCs and by complex inter-chromosomal rearrangement processes 39 . This has important implications for therapeutic strategies to target frequently altered oncogenes, as cancer types of high genomic instability may be predisposed to containing treatment-resistant clones. Multi-allelic variation within the same locus is also a highly prevalent feature of breast and ovarian cancers, consistent with some previous observations in other cancers 21,22,30 .  . j, Rate of gains and losses within whole chromosomes (n = 35 events), chromosome arms (n = 31 events) and segments (n = 341 events) on diploid (1|1) and tetraploid (2|2) backgrounds. WGD, whole-genome duplication. k,l, Correlation of the number of parallel copy number events with copy number distance (P = 0.0008) (k) and phylogenetic distance (P = 0.0003) (l). Annotations at the top indicate the correlation coefficient (R) and P value derived from a linear regression; shaded areas in plots show the 95% confidence interval of the linear regression. All box plots indicate the median, first and third quartiles (hinges), and the most extreme data points no farther than 1.5× the IQR from the hinge (whiskers).

Article
Notably, events that appeared clonal at the total copy number level were often composed of distinct clades with different alleles gained or lost; this might reflect evolutionary convergence for favourable karyotypes at the total copy number level, as shown by transcriptional phenotypic convergence. Evolutionary time series modelling 18 is likely to further help to resolve patterns of phenotypic selection from parallel CNAs. We also highlight that sporadic gains and losses happen on both alleles, with rates increased on a whole-genome-doubled background relative to diploid, potentially reflecting increased fitness tolerance owing to genomic redundancy. Finally, megabase-scale copy length variation at a single-cell level (SSV) has been observed in vitro with cell-selected single-cell sequencing 1 . Here we show with single-cell genome sequencing at the cell-population level that SSVs are in fact prevalent in TNBC and HGSC and distribute across clones within tumours. Although the underlying mechanisms that generate SSVs are unknown, they represent a new class of variation that may contribute to the structural copy evolution of tumours enriched in the FBI background and in polyploid genome states. We observed each of the foreground mutational patterns in all mutational processes, but FBI-type tumours showed a significant enrichment in all three foreground patterns. As such, FBI may comprise a distinct phenotypic class in which foreground mutational patterns generate diversity that could underlie poor prognostic significance. We conclude that scaled single-cell sequencing is a useful means to reveal hidden cellular states of structural copy number diversity in genomically unstable tumours. The data that we present here show that foreground   72  56  100  96  92  115  63  53  13  25  77  76  33  43  100  108  37  27  54  58   HRD-Dup  TD  FBI   SA1050  SA1051  SA1052  SA1053  SA501  SA535  SA1181  SA1184  SA1047  SA1093  SA1049  SA1096  SA1162  SA530  SA604  SA609  SA610  SA1182  SA1180  mutational patterns are key determinants of genomically encoded phenotypic diversity and consequent 'evolvability' in cancer.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-022-05249-0. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Verification of mutations in the 184-hTERT cell line
Genomic DNA was extracted from 184-hTERT cell lines and regions of interest of BRCA1 or BRCA2 were amplified by PCR. Amplicons were inserted into a pCR-TOPO vector and transformed into Escherichia coli using the TOPO TA cloning kit (Thermo Fisher Scientific). Colonies were selected, DNA purified by Purelink Quick Plasmid Miniprep kit (Thermo Fisher Scientific) and sequenced by Sanger sequencing to assess CRISPR-induced mutations.

Acquisition of samples from patients and patient consent
Samples were acquired with informed consent, according to procedures approved by the Ethics Committees at the University of British Columbia. Patients with breast cancer undergoing diagnostic biopsy or surgery were recruited and samples were collected under protocols H06-00289 (BCCA-TTR-BREAST), H11-01887 (Neoadjuvant Xenograft Study), H18-01113 (Large-scale genomic analysis of human tumours) or H20-00170 (Linking clonal genomes to tumour evolution and therapeutics). HGSC samples were obtained from women undergoing debulking surgery under protocols H18-01652 and H18-01090. Banked HGSC and TNBC specimens were obtained at the Memorial Sloan Kettering Cancer Center following Institutional Review Board (IRB) approval and patient informed consent (protocols 15-200 (HGSC) and 18-376 (TNBC)). HGSC and TNBC clinical assignments were performed according to American Society of Clinical Oncology guidelines for ER, PR and HER2 positivity.

Xenografting
Fragments of tumours from patients were chopped finely with scalpels and mechanically disaggregated for one minute using a Stomacher 80 Biomaster (Seward Limited) in 1 ml cold DMEM/F-12 with glucose, l-glutamine and HEPES (Lonza 12-719F). Two hundred microlitres of medium containing cells or organoids from the resulting suspension was used equally for transplantation in four mice. The remaining tissue fragments were cryopreserved viably in DMEM/F-12 supplemented with 47% FBS and 6% dimethyl sulfoxide (DMSO). Tumours were transplanted in mice as previously described (Eirew) in accordance with SOP BCCRC 009. Female NOD/SCID/IL2rγ −/− (NSG) and NOD/ Rag1 −/− Il2rγ −/− (NRG) mice were bred and housed at the Animal Resource Centre (ARC) at the British Columbia Cancer Research Centre. For subcutaneous transplants, mechanically disaggregated cells and clumps of cells were resuspended in 150-200 µl of a 1:1 v/v mixture of cold DMEM/F-12:Matrigel (BD Biosciences). Female mice (8-12 weeks old) were anaesthetized with isoflurane and the mechanically disaggregated cell clump suspension was transplanted under the skin on the left flank using a 1-ml syringe and 21-gauge needle. Mice were housed at a 18-25 °C temperature range and 20-70% humidity range, with a 12-h daylight cycle (on at 06:00; off at 18:00). All animal experimental work was approved by the animal care committee (ACC) and animal welfare and ethical review committee at the University of British Columbia (UBC) under protocol A19-0298.

Tissue processing
Xenograft-bearing mice were euthanized when the size of the tumours approached 1,000 mm 3 in volume, according to the limits of the experimental protocol. The tumour material was excised aseptically and processed as described for primary tumour. A section of tumour was fixed in 10% buffered formalin for 24 h, dehydrated in 70% ethanol and paraffin-embedded before duplicate 1-mm cores were used to generate tissue microarrays for staining and pathological review. Remaining tumour was finely chopped and gently paddle-blended, and released single cells and fragments were viably frozen in DMEM supplemented with 47% FBS and 6% DMSO.

Histopathology of PDX tumours
Deparaffinized 4-µm sections of tissue microarrays were stained with haematoxylin and eosin or KRAS (Lifespan Bioscience, LS-B4683, 1:50), performed using the Ventana Discovery XT platform and the UltraMap DAB detection kit. HGSC pathology was confirmed by an anatomical pathology resident at University of British Columbia, under the supervision of a certified staff pathologist.

WGS
Genomic DNA was extracted from frozen tissue fragments using the DNeasy Blood and Tissue kit (Qiagen) and constructed libraries for whole genomes of 309 tumour-normal pairs were sequenced on the Illumina HiSeqX according to Illumina protocols, generating 100-bp paired-end reads for an estimated coverage of sequencing between 40× (normal) and 80× (tumour). Sequenced reads were aligned to the human reference GRCh37 (hg19) using BWA-MEM.

Long-read sequencing
High-molecular weight (HMW) DNA was extracted from fresh and/or frozen tissue fragments using the MagAttract HMW DNA Kit (Qiagen) and size-selected using Blue Pippin for single long-molecule sequencing on the PromethION (Oxford Nanopore Technologies).

Generation of single-cell suspensions and nuclei for scDNA-seq
Viably frozen aliquots of patient tissues and PDX tumours were thawed and either homogenized and lysed using Nuclei EZ Buffer (Sigma) or enzymatically dissociated using a collagenase/hyaluronidase 1:10 (10×) enzyme mix (STEMCELL Technologies), as described previously 3,18 . Cells and nuclei were stained with CellTrace CFSE (Life Technologies) and LIVE/DEAD Fixable Red Dead Cell Stain (Thermo Fisher Scientific) in a 0.04% BSA/PBS (Miltenyi Biotec 130-091-376) incubated at 37 °C for 20 min. Cells were pelleted and resuspended in 0.04% BSA/PBS. This single-cell suspension was loaded into a contactless piezoelectric dispenser (Cellenone or sciFLEXARRAYER S3, Scienion) and spotted into open nanowell arrays (SmartChip, TakaraBio) preprinted with unique dual index sequencing primer pairs. Occupancy and cell state were confirmed by fluorescent imaging and wells were selected for single-cell copy number profiling using the DLP+ method 3 . In brief, cell dispensing was followed by enzymatic and heat lysis. After cell lysis, tagmentation mix (14.335 nl TD buffer, 3.5 nl TDE1 and 0.165 nl 10% Tween-20) in PCR water was dispensed into each well followed by incubation and neutralization. For BRCA1 +/− cells, the tagmentation mix consisted of 10 nl TB1 buffer and 10 nl BT1 enzyme without Tween-20 in PCR water. Final recovery and purification of single-cell libraries was done after eight cycles of PCR. Pooled single-cell libraries were analysed using the Agilent Bioanalyzer 2100 HS kit. Libraries were sequenced at the UBC Biomedical Research Centre on the Illumina NextSeq 550 (mid-or high-output, paired-end 150-bp reads), or at the Genome Sciences Centre on the Illumina HiSeq2500 (paired-end 125-bp reads) and Illumina HiSeqX (paired-end 150-bp reads). The data were then processed through a quantification and statistical analysis pipeline 3 .

Generation of 10X scRNA-seq data
The 184-hTERT cells were pelleted and gently resuspended in 200 µl PBS followed by 800 µl 100% methanol and incubation at −20 °C for 30 min to fix, dehydrate and shrink cells. PDX tumour fragments were dissociated into single cells using collagenase/hyaluronidase at 37 °C for 2 h for TNBC tumours or with cold active Bacillus lichenformis (Creative Enzymes NATE0633) in PBS supplemented with 5 mM CaCl 2 and 125 U ml −1 DNAse for HGSC tumours, as described previously 45 with additional mechanical dissociation using a gentleMACS dissociator (Miltenyi Biotec). Cells were then pelleted and resuspended in 0.04% BSA/PBS and immediately loaded onto a 10X Genomics Chromium single-cell controller targeting 3,000 cells for recovery. Libraries were prepared according to the 10X Genomics Single Cell 3′ Reagent kit standard protocol. Libraries were then sequenced on an Illumina Next-seq500/550 with 42-bp paired-end reads, or a HiSeq2500 v4 with 125-bp paired-end reads. 10X Genomics Cell Ranger 3.0.2 was used to perform demultiplexing, counting and alignment to GRCh38 and mm10.
Processing of bulk whole-genome data SNV and SV calls for 121 HGSC samples were acquired from a previous study 12 . For new samples, reads were aligned to the hg19 reference genomes using BWA-MEM. Processing proceeded as per the aforementioned study 12 to maintain consistency.
SNVs were called with MutationSeq 46 (probability threshold = 0.9) and Strelka 47 . The intersection of calls from these methods were retained; however, SNVs falling in blacklist regions were removed. The blacklist regions include the UCSC Genome Browser Duke and DAC blacklists, and those in the CRG Alignability 36mer track that had more than two mismatched nucleotides. SNVs were then annotated with OncoKB 48 for variant impact.
SVs were called using deStruct 49 and LUMPY 50 , and breakpoints called by both methods were retained. We then filtered events with the following criteria: any breakpoints falling in the blacklists described above; ≤30-bp inter-breakpoint distance; <1,000-bp deletion; any breakpoints with fewer than 5 supporting reads in the tumour sample or any read support in the matched normal sample.
Gene mutation enrichment analysis was performed using the hypergeometric test for SNVs, amplifications and deep deletions separately, comparing each signature stratum to all other samples.

Nanopore data analysis
For Nanopore sequence data, base calling and read alignment were performed using Guppy v.3 and Minimap2, respectively 51,52 . Reads that were likely to be derived from mouse were filtered by first aligning to a concatenated hg19 and mm10 reference, removing reads with alignments to mm10 and re-aligning the remaining reads to hg19. Signal artefact regions as well as alignments with mapping quality of less than 60 were excluded from the final alignments. Alignments were then phased using the PEPPER-Margin-DeepVariant pipeline, after which WhatsHap was used to tag reads in the filtered alignments using phasing information 53,54 . SV calling was performed using Sniffles (v.1.0.12) and cuteSV (v.1.0.11) with 5 read support, and subsequently merged using SURVIVOR for a union set of predicted variants 55,56 . Alignments and variants were visualized using IGV, Ribbon and the karyoploteR R package [57][58][59] .

HGSC and TNBC meta-cohort signature analysis
Signature analysis was performed according to a previous study 10 . The MMCTM model was run on the sample SNV and SV count matrices. The number of signatures to estimate in the HGSC and TNBC integrated cohort was chosen by running the above fitting procedure for k = 2-16 for both SNV and SV signatures with the number of restarts set to 500, in which k is the number of signatures. We performed this step on approximately half the mutations in each sample, then computed the average per-mutation log likelihood on the other held-out half of the mutations. The elbow curve method on log-likelihood values was used to select the final number of signatures to fit to the entire dataset.
To estimate MMCTM parameters on the full dataset, α hyperparameters were set to 0.1. The model was initially fit to the data 1,500 times. Each restart was run for a maximum of 1,000 iterations or until the relative difference in predictive log likelihood on the training data was less than 10 −4 between iterations. The restarts with the best predictive log likelihoods for SNVs and SVs were selected as seeds for the final fitting step. The model was again fit to the data 1,500 times. The model parameters for each restart were set to the parameters of the optimal models from the previous step described above, then run for a maximum of 1,000 iterations or until the relative difference in predictive log likelihood on the training data was less than 10 −5 between iterations. The restart with the best mean rank of the SNV and SV predictive log likelihoods from this round was selected as the final model. MMCTM estimated SNV signatures were matched to COSMIC signatures by solving the linear sum assignment problem for cosine distances between the MMCTM and COSMIC signatures v.3 (minus tobacco smoking-associated COSMIC SBS4) using the clue R package 60 .
Samples were clustered by first applying UMAP 61 to the normalized signature probabilities for the HRD SNV signature and all SV signatures with n_neighbors = 20 and min_dist = 0 to produce two-dimensional sample embeddings. Next, HDBSCAN 62 was run on the sample embeddings with min_samples = 5, min_cluster_size = 5 and cluster_selection_ epsilon = 0.75 to produce the sample clusters (strata).

Survival analysis
For each patient, the number of days between diagnosis and death or last follow-up was collected. Patients were segregated into groups, and a Kaplan-Meier curve was fitted for each group. Each cancer type was analysed separately and in two distinct grouping schemes. First, patients were split into HRD and 'Other' groups, in which the HRD group included patients whose cancers were identified as being in either the HRD-Dup or HRD-Del groups, and the 'Other' group included all other patients. Next, patients were grouped according to their assigned signature types: HRD-Dup, HRD-Del, TD or FBI.

DLP+ WGS quantification and analysis
Single-cell copy number, SNV and SV calls were generated using a previously described approach 3 , except that BWA-MEM 63 was used to align DLP+ reads to the hg19 reference genome. The genome was segregated into 500-kb bins, and GC-corrected read counts were calculated for each bin. These read counts were then input into HMMCopy 64 to produce integer copy number states for each bin.

DLP+ data filtering
Cells were retained for further analysis if the cell quality was at least 0.75 (ref. 3 ), and they passed both the S-phase and the contamination filters. The contamination filter uses FastQ Screen 65 to tag reads as matching human, mouse or salmon genomes. If more than 5% of reads in a cell are tagged as matching the mouse or salmon genomes, then the cell is flagged as contaminated. The S-phase filter uses the cell-cycle state Random Forest classifier from ref. 3 and removes cells for which S-phase is the most probable state. The HGSC and TNBC cells were also filtered to remove small numbers of contaminating diploid cells.
Finally, cell filtering was performed to remove putative early and late S-phase cells that passed the initial S-phase filter. This involved two steps: first, building a cell phylogeny with sitka 37 and manually identifying the minimal phylogeny branches in which the cycling cells have been clustered. The cells in these branches were then removed. Next, clustering cells according to their copy number profiles and removing manually identified clusters of cycling cells.
We removed potentially problematic genome bins from our copy number results that had a mappability score of 0.99 or below, or that were contained in the ENCODE hg19 blacklist 66 .
To detect SNVs and SVs in each dataset, reads from all cells in a DLP+ library were merged to form 'pseudobulk' libraries. SNV calling was performed on these libraries individually using MutationSeq 46 (probability threshold = 0.9) and Strelka (score > 20) (ref. 47 ). Only SNVs that were detected by both methods were retained. For each dataset, the union of SNVs was aggregated, then for each cell and each SNV, the sequencing reads of that cell were searched for evidence of that SNV. SV calling was performed in a similar manner, by forming pseudobulk libraries, then running LUMPY 50 and deStruct 49 on each pseudobulk library, and retaining events that were detected by both methods. LUMPY and deStruct predictions were considered matched if the breakpoints matched in orientation and the positions involved were each no more than 200 nucleotides apart on the genome. Only deStruct predictions with a matching LUMPY prediction were retained. Sparse per-cell breakpoint read counts were extracted from deStruct using the cell identity of read evidence for each predicted breakpoint. SNV and SV calls were further post-processed according to a previous study 12 . When performing pseudobulk analysis on groups of cells, a breakpoint is considered present in a clone if at least one cell that constitutes the clone contains evidence of the breakpoint. A subsampling experiment determined that this approach has 80% power to recover breakpoints at a cumulative coverage of 5× (100-150 cells) (see Supplementary Note).

Analysis of mutation signatures in DLP+ data
Mutation signature probabilities were fit to DLP+ pseudobulk-derived SNV and SV counts for each patient using the MMCTM method and pre-computed mutation signatures from the HGSC and TNBC meta-cohort. Inference was performed as per the bulk sequencing data, until the relative difference in predictive log likelihood was < 10 −6 between iterations.

Identifying clones in DLP+ WGS by clustering copy number profiles
For most datasets, clones were detected by first using UMAP on per-cell GC-corrected read count profiles, producing a two-dimensional embedding of the cell profiles. We then ran HDBSCAN on the two-dimensional embedding from UMAP to detect clusters of cells with similar copy number profiles.

Calculating cell ploidy
Cell ploidy was calculated by taking the most common copy number state. Copy number states were those determined by HMMCopy.

Identifying missegregated chromosomes
The approach taken to identify putative chromosome missegregation events is similar to a previous one 3 . Cells were split into groups corresponding to their clones. Clone copy number profiles were generated for each clone. Cells with ploidy not equal to the clone consensus profile were normalized to match the clone ploidy. Cell copy number profiles were compared to the clone copy number profile for the matching clone to which the cell belongs. The result was assignment of an offset value for each genomic bin in each cell, that represented the copy number difference between the cell and the clone-level consensus profile. For each chromosome in each cell, if a particular copy number difference (that is, −1, 1, and so on) represented at least 75% of the chromosome, then we labelled that chromosome as having a missegregation event.

Identifying CNA segments
Gain and loss segments in each cell were found by comparing the copy number state in each 500-kb bin to that cell's ploidy. A copy number higher than ploidy was labelled as a gain, and a copy number lower than ploidy was labelled as a loss. Gain and loss segments are a set of consecutive bins with the same gain-loss label. Segments ≤ 1.5 × 10 6 bp were excluded to reduce segments potentially resulting from noise in the HMMCopy copy number states.

Computing serriform variability scores in CNA breakpoints
For each dataset, consensus copy number profiles were generated for each clone. Copy number segments were identified as above for each consensus profile. Copy number segments were then identified for single-cell copy number profiles. The copy number profiles of each cell were normalized so that the adjusted cell ploidy matched the ploidy of the clone to which the cell belonged using the following formula: cell_state = cell_state/cell_ploidy × clone_ploidy Cell copy number segments were matched to segments in the clone copy number profile as follows: for each segment in the clone copy number profile, inspect the copy number states of the adjacent segments. If the segment state was less than both adjacent states, then only cell segments whose state was less than both of the two adjacent clone segment states could be matched to that segment. If the clone segment state was higher than both adjacent states, only cell segments whose state was higher than both of the adjacent clone segment states could be matched to that segment. If the clone segment state was in between the two adjacent states, only cell segments whose state was in between the two adjacent clone segment states could be matched to that segment. Finally, each cell segment was matched to the compatible clone segment that it overlapped the most, in which compatibility means that the cell segment state met the criteria described above and the cell belonged to the relevant clone.
Next, clone segment breakpoints were aggregated across all clones. For each breakpoint, matched cell breakpoints were identified. Stable cell breakpoints (that is, those cell breakpoints that matched the clone-level breakpoint position) and unstable cell breakpoints (all other cell breakpoints) were queried for their raw GC-corrected read count values up to five bins to the left and right of the breakpoint position. Breakpoint noise values were computed as the mean absolute value of the difference between these values and the integer copy number state inferred by HMMCopy. For each clone-level breakpoint event, cells were removed if their breakpoint noise values were higher than a threshold value, which was computed as the mean noise value of the stable cell breakpoints.
Serration scores for each event were calculated by first computing the frequency of cell-specific breakpoint positions. Each cell breakpoint position was considered 'rare' if it occurred in less than 5% of cells with the considered event. The final serration score was computed as the fraction of event cells whose breakpoint position was considered 'rare'.
For comparing serration rates between cases, breakpoints with at least 100 cells, and whose adjacent copy number segments were in total at least 20 Mbp (40 genomic bins) were retained. This was done to retain only those breakpoints for which serration could be reliably computed. As a result, SA605 was not included in comparisons as this case had fewer than 100 cells. A zero-inflated generalized linear mixed model with a beta response that accounted for case-specific and signature-type effects was fit to determine the effect of mutation signature type on serration scores.

Comparison of HLAMP copy number variance
HLAMPs were identified by first selecting 500-kb genomic bins in which at least 10 cells have a raw copy number (adjusted per-bin read counts) of at least 10. Copy number variance for each bin was calculated using the raw copy number that was adjusted for cell ploidy and cell clone by first dividing the copy number by the cell ploidy, then subtracting the mean clone raw copy number. The cell ploidy is the most common HMMCopy copy number state as described above, and the mean clone copy number is computed for each bin in each clone across all cells in that clone. Mean HLAMP copy number variance was calculated for each dataset across all HLAMP bins, and these values were compared between signature type dataset groups.

Clustering HLAMP genomic features
To explore plausible mechanistic origins of oncogenic HLAMPs we extracted genomic features proximal to the locus of interest. We took a region 15 Mbp either side of the locus of interest and pulled out copy number and SV features. We extracted the following features: entropy of haplotype-specific states; total number of SVs identified; proportion of SVs of each type (fold-back inversions, duplications, deletions and translocations); number of chromosomes involved in translocations; ratio of copy numbers between the bin containing the oncogene and the average copy number across the chromosome; average copy number state; average size of segments; average number of segments; and average minor allele copy number. All averages are across cells. We then performed hierarchical clustering on a scaled matrix of all features, using the silhouette width to determine the appropriate number of clusters.

HSCN analysis
See the Supplementary Note for a detailed discussion of our method, SIGNALS, for HSCN analysis. This includes validation of the method and benchmarking against other methods. In brief, SIGNALS uses haplotype blocks genotyped in single cells and implements an hidden Markov model (HMM) based on a Beta-Binomial likelihood to infer the most probable haplotype-specific state. We used default parameters for all datasets apart from SA1292, in which we increased the self transition probability from 0.95 to 0.999 to mitigate against the noisier copy number data in this sample.

Pseudobulk HSCN profiles
In numerous places in this study we construct 'pseudobulk' haplotype-specific or total copy number profiles either across all cells in a sample or subsets of cells that share some features of interest. To do this, we group the cells of interest and then compute an average profile by taking the median values of copy number and BAF and the mode of the haplotype-specific state. The function 'consensuscopynumber' provided in SIGNALS was used for this.

Comparing segmentation profiles across cells
To facilitate comparisons of genomic profiles across cells, we inferred a set of disjoint segments from the consensus copy number profiles of clusters. For each clone or cluster we generated a consensus segmentation profile, and then used the 'disjoin_ranges' function from plyranges 67 to generate a non-overlapping disjoint segmentation profile. Each segment was then genotyped in each cell by taking a consensus across the bins within each segment, producing a consistent set of genomic segments and states that could be compared across cells.

Identification of parallel copy number events
The set of genotyped disjoint segmentation profiles was used to calculate the number of parallel copy number CNAs. Parallel CNAs were defined as genomic regions greater than 4 Mbp in which gain or loss of both the maternal and paternal haplotype was observed in more than 1% of cells. Copy number breakpoints of segments do not need to match to be included.

Phylogenetic analysis
We computed phylogenetic trees using sitka as previously described 37 , using the consensus tree from the posterior distribution for downstream analysis. For visualization, clades with a high fraction of singletons (nodes with a single descendant) were removed. To remove nodes, nodes were ordered by the fraction of descendants that were singletons, and nodes were removed iteratively until a maximum of 3% of cells in the tree were removed. Trees were visualized using ggtree 68 and functionality in SIGNALS. Phylogenetic distances were computed as the mean pairwise distances between phylogenetic tips (cells) using the cophenetic function in APE 69 . Distances represent the number of copy number change points between two cells on the phylogeny.

Event rates inferred from single-cell phylogenies
To compute the rates of gains and losses of whole chromosomes, chromosome arms and segmental aneuploidies we enumerated the number of events from our single-cell phylogenies using parsimony-based ancestral state reconstruction. We used the genotyped disjoint segmentation profiles for this.
We first defined states for each segment in each cell relative to the most common state across all cells. For each segment, cells can have one of two possible states for each class of interest: (gain, not gained), (loss, not lost). By casting the problem as reconstructing the ancestral states within the phylogeny, we can then compute the number of transitions between these states that most parsimoniously explains the phylogenetic tree. We used a simple transition matrix in which transitions between states incur a cost of 1. Ancestral state reconstruction then amounts to finding the reconstruction that minimizes this cost. The event frequency per sample is then calculated by dividing the parsimony score (number of events) by the number of cells. We used castor v.1.6.6 in R to perform the ancestral state reconstruction 70 . The unit of this quantity is the number of events per cell division, assuming no cell death. It is possible (perhaps likely) that many cells get segmental gains or losses but then die, we never sample such cells and our phylogenetic tree reconstructs ancestral relationships between cells that survive and that we sample. It is challenging to decouple the death rate of cells from the true event rate per cell division 71 ; thus, our event rate is an effective event rate; that is, the event rate scaled by the (unknown) death rate of cells. To contrast the rates across different types of events, we classified segments as whole chromosomes, chromosome arms or segmental aneuploidies.

Calculation of copy number distance
The copy number distance calculates the number segments that need to be modified to transform one copy number profile into another 20 . We use this measure to compute cell-to-cell variation in our dataset. To compute this measure, we modified the code provided in a previous study 25 to take into account whole-genome doubling of cells (https:// github.com/raphael-group/WCND). We did this as follows: given two copy number profiles (integer copy number states of individual haplotypes in bins across the genome) CNP A and CNP B , we computed the following distances: in which 2× refers to doubling the copy number state across the whole genome. We then took the copy number distance to be If the minimum was d 2 or d 3 , we increased d by 1 (that is, counting WGD as an additional event). Calculating all pairwise comparisons is computationally expensive, so for each dataset we subsampled 250 cells and calculated all pairwise distances for these 250 cells.
10X scRNA-seq processing CellRanger software (v.3.1.0) was used to perform read alignment, barcode filtering and quantification of unique molecular identifiers (UMIs) using the 10X GRCh38 transcriptome (v.3.0.0) for FASTQ inputs. CellRanger filtered matrices were loaded into individual Seurat objects using the Seurat R package (v.4.1.0) (refs. 72,73 ). The resulting gene-by-cell matrix was normalized and scaled for each sample. Cells retained for analysis had a minimum of 500 expressed genes and 1,000 UMI counts and less than 25% mitochondrial gene expression. Genes expressed in fewer than three cells were removed. Cell-cycle phase was assigned using the Seurat 73 CellCycleScoring function. Scrublet 74 (v.0.2.3) was used to calculate and filter cells predicted to be doublets. We then applied the standard Seurat processing pipeline using default parameters apart from using the first 20 principal component analysis (PCA) dimensions for nearest neighbour and UMAP calculations.

Allelic imbalance in scRNA-seq
We called heterozygous SNPs in the scRNA-seq data using cellSNP v.1.2.2 (ref. 75 ). As input, we used the same set of heterozygous SNPs identified in the scDNA-seq and the corresponding normal sample for each sample. The liftover script provided in cellSNP was used to lift over SNPs from hg19 to hg38. After genotyping, we phased the SNPs using the phasing information computed from the haplotype-specific inference in the scDNA-seq. As SNP counts are much more sparse in scRNA-seq than in scDNA-seq (around two orders of magnitude lower), we aggregated counts across segments (minimum size = 10 Mbp), computing the BAF for each segment. We then generated a cell by segment BAF matrix and incorporated this into our gene expression Seurat objects. We applied an additional filtering criterion here, removing cells with fewer than 200 SNP counts. Functionality to map scDNA-seq to scRNA-seq and call allelic imbalance is provided in SIGNALS.

Differential expression analysis
Differential expression analysis between gene expression clusters was computed using the Wilcoxon rank sum test with the presto R package. Gene expression clusters were computed using the FindClusters function in Seurat. Only cells in G1 phase were included. To compare gene expression variability for oncogenes, we took the absolute maximum log-transformed fold change for each sample for each oncogene and contrasted this value in cases in which oncogene copy number was determined to be fixed or variable from DLP+ single-cell sequencing of the same samples. 'Variable' oncogenes were defined as those that had a minimum ratio of 2 between the maximum to minimum clone-level copy number, and 'non-variable' oncogenes as those that had a ratio of less than 2.

Nearest neighbour gene expression analysis
To assess transcriptional convergence of losses of alleles we made use of the shared nearest neighbour graph computed using Seurat. This was done for chr. 2q in sample SA906b. For a given cell, an enrichment score was defined as the observed fraction of nearest neighbours divided by the expected fraction of nearest neighbours. Here, the expected fraction of neighbours with the same allelic state was defined as the global fraction of cells in each state. Hence, a positive enrichment score indicates an overrepresentation of cells in the allelic state of interest among its nearest neighbours, a negative score indicates an underrepresentation and a score of 0 would reflect a perfectly mixed neighbourhood of cells with different allelic states. To mitigate the influence of other technical or biological variability, for this analysis we only included cells in G1 phase, and removed cells with greater than 7.5% mitochondrial gene expression as we found that this was variable between gene expression clusters.

Statistical tests
The statistical tests used were two-tailed unequal-variance t-tests unless otherwise specified: log-rank tests were used for comparing survival curves; Wilcoxon rank sum two-tailed tests were used for comparing segment lengths, segment counts, missegregations and ploidy percentages, copy variances, bin counts, gene copy number distributions, gene expression log-transformed fold changes, parallel copy number counts and breakpoint counts; and hypergeometric tests were used to identify enrichment of gene mutations. P values from multiple comparisons were corrected using the Benjamini-Hochberg method 76 .

Box plot statistics
All box plots indicate the median, first and third quartiles (hinges), and the most extreme data points no farther than 1.5× the IQR from the hinge (whiskers).

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data are available for general research use. Processed data including somatic mutation data for bulk WGS, total (and allele-specific) copy number profiles for DLP+ data and filtered count matrices for scRNA-seq data are available for download at https://zenodo.org/ record/6998936. Raw scRNA-seq data are available for download at https://ega-archive.org/studies/EGAS00001006343. Raw single-cell sequencing data generated for this study are available from https:// ega-archive.org/studies/EGAS00001006343, and previously published single-cell sequencing data used in this study are available at https:// ega-archive.org/studies/EGAS00001004448 and https://ega-archive. org/studies/EGAS00001003190. Somatic mutation calls from bulk WGS for 16 patients with TNBC for whom the IRB consent does not include public deposition of raw sequencing data are available at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_ id=phs003038.v1.p1, and raw sequencing data can be provided upon request under material transfer agreement to shahs3@mskcc.org. Bulk WGS BAM files from patients under IRB consent protocols for public release of raw data are available for download at https://ega-archive. org/studies/EGAS00001006343, http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs003036.v1.p1 and https:// ega-archive.org/datasets/EGAD00001003268 (for previously published data 12 ) or by request under material transfer agreement to shahs3@mskcc.org and saparicio@bccrc.ca. Single-cell and long-read sequencing (c) was used to examine mutational processes and haplotype-specific genomic diversity, including HLAMPs or rearrangements d), parallel events e) and SSVs f) at single-cell resolution and within clonal and subclonal populations. Generated using Biorender.com.  SA1050  SA1051  SA1052  SA1053  SA1181  SA1184  SA501  SA535  SA1047  SA1093  SA605  DG1197  SA1049  SA1096  SA1162  SA1182  DG1134  SA1180  SA1091  SA530  SA604  SA609   Chr. 3 (Mb)

Extended Data Fig. 5 | Haplotype-specific analysis reveals breakagefusion-bridge processes and parallel losses. a) Diagram of BFBCs b)
Heat maps of the copy number of each homologue in SA1188. c-f) HSCN and structural variation in clusters B, I, F and the small subpopulation with PIK3CA amplification. Here we plot the copy number for each homologous chromosome in purple for homologue B and green for homologue A. g) Parallel copy number losses in SA906b: total copy number (top) and HSCN (bottom) heat maps for chr2 in SA906b h) two individual cells from g). i) UMAP dimensionality reduction plots of scRNA-seq data generated from SA906b, colours indicate the density of loss of chr 2q A vs. B haplotype. j) Enrichment of the haplotype-specific state on chr 2q of nearest neighbour cells (# cells with loss of A = 175, # of cells with loss of B = 34, # Balanced = 2066). All box plots indicate the median, 1 st and 3 rd quartiles (hinges), and the most extreme data points no farther than 1.5x the IQR from the hinge (whiskers). HRD and e) more granular signatures (p-values computed using the log-rank test, p = 0.0038 for d) and p = 0.0022 for e)). All box plots indicate the median, 1 st and 3 rd quartiles (hinges), and the most extreme data points no farther than 1.5x the IQR from the hinge (whiskers). Fig. 8  a Extended Data Fig. 9 | See next page for caption.