DEVOLUTION—A method for phylogenetic reconstruction of aneuploid cancers based on multiregional genotyping data

Phylogenetic reconstruction of cancer cell populations remains challenging. There is a particular lack of tools that deconvolve clones based on copy number aberration analyses of multiple tumor biopsies separated in time and space from the same patient. This has hampered investigations of tumors rich in aneuploidy but few point mutations, as in many childhood cancers and high-risk adult cancer. Here, we present DEVOLUTION, an algorithm for subclonal deconvolution followed by phylogenetic reconstruction from bulk genotyping data. It integrates copy number and sequencing information across multiple tumor regions throughout the inference process, provided that the mutated clone fraction for each mutation is known. We validate DEVOLUTION on data from 56 pediatric tumors comprising 253 tumor biopsies and show a robust performance on simulations of bulk genotyping data. We also benchmark DEVOLUTION to similar bioinformatic tools using an external dataset. DEVOLUTION holds the potential to facilitate insights into the development, progression, and response to treatment, particularly in tumors with high burden of chromosomal copy number alterations.

*Focuses exclusively on SNV in copy-number neutral, loss of heteroztgosity (LOH)-free por�ons of the genome.
*Do not compare informa�on between samples during the inference procedure. **Limited to WGS data.
*Limited to WGS data. -Can read SNP-array data directly and/or sequence data to compute the number of clones, although no more than five and is limited to one single sample.
-Does not infer the order of events.
-Does not integrate informa�on from mul�ple samples.
-Uses SNP-array data or sequencing data to infer the "CCF" (cancer cell frac�on=cellular prevalence of SNVs in this paper). They do consider the order of the SNV rela�ve to the CNA.
-Input data: MCFs inferred from SNP-array, WGS, WES and/or TDS in unison or separately.
-There is no limit for the number of samples that can be analysed.
-SNP-array data can be analysed alone.
-Informa�on about point muta�ons can be added to the segment file to be integrated in the process.
-Subclones are inferred and their distribu�on across biopsies visualized in a phylogene�c tree.    pies. An asterix after the patient name indicates that the MP-and ML-trees for this tumor are identical. When they are not identical, both are displayed. The endpoints of the trees represent cell populations harboring distinct genomic profiles (subclones), whose fractions across samples are visualized as pie charts. Biopsies are available from the primary tumor before treatment (B) and after treatment (P), relapses (R), distant metastases (M), lymph node metastases (LN) and lung metastases (LU). The segment files used to produce the phylogenies can be found in Supplementary data 1 and the corresponding event matrices produced by DEVOLUTION in Supplementary data 2.

Supplementary Figure 9 RMS7
The user might want to reconsider if one of the events in cluster J actually is D (*).  Figure 9 Contradictions in a complex dataset. The tables illustrate the alterations in RMS7 found in biopsy before treatment (B), after treatment (P), a lymph node metastasis (M-LN), and a lung metastasis (M-LU). The first column is the location and type of alteration, the second column its size in the biopsy and the third its clustering. The ovals in the lower part of the figure illustrates the suggested temporal allocation of the clusters of events based on their sizes across the biopsies. In this case all events are > 50 % resulting in only one solution to how the clones are nested. Below each oval the temporal allocation is indicated. In biopsy B the alteration in cluster B exist in a cell having the alterations contained by clusters A and C. In B it exists in a cell having the alteration contained by clusters D, E, F and G. From these two biopsies from the primary tumor it thus seems as if B is a very early event in the evolution of the tumor. When looking at the lymph node biopsy we find cells having the alterations contained by D, E, H and I without B. This indicates that B is lost, which is unlikely since it is an intrachromosomal aberration. In addition, one of the alterations in cluster J might actually represent the same genetic alteration as the one forming cluster D in the other samples. It is considered to be two different events in this example since one of the end points differ by more than 1 Mbp. Reconsidering these two alterations results in a tree without contradictions in the evolutionary history of this rhabdomyosarcoma.      Supplementary Figure 14 TRACERx analysis using DEVOLUTION and MAGOS. a-d) Result of the analysis of the 20 NSCLC from the TRACERx data set. The leftmost scatterplot illustrates the VAF-distribution of the events that passed quality control across the biopsies. Above the plot the corresponding tumor name is denoted. An asterisk indicates that the MP-and ML-trees were identical. When they are not identical, both are displayed, for which the upper row represents MP-trees and the bottom row the ML-trees. The leftmost phylogenetic tree is based on manual nesting of the clusters obtained using MAGOS. The middle phylogenetic tree is the output of DEVOLUTION on the events feasible for analysis with MAGOS i.e. solely events present in all biopsies. The rightmost phylogenetic tree is the output of DEVOLUTION based on all mutations that passed quality control, i.e. also events that are found in merely a subset of samples. In CRUK0068 the MAGOS clustering results in 5 clusters denoted cluster 2-6. The clusters cross each other in size across the sizes, which makes it not feasible to do a nesting that are in concordance across all biopsies. The information used to produce the phylogenies can be found in Supplementary data 6.

Supplementary Figure 16 TRACERx analysis using DEVOLUTION and MAGOS. a-e)
Result of the analysis of the 20 NSCLC from the TRACERx data set. The leftmost scatterplot illustrates the VAF-distribution of the events that passed quality control across the biopsies. Above the plot the corresponding tumor name is denoted. An asterisk indicates that the MP-and ML-trees were identical. When they are not identical, both are displayed, for which the upper row represents MP-trees and the bottom row the ML-trees. The leftmost phylogenetic tree is based on manual nesting of the clusters obtained using MAGOS. The middle phylogenetic tree is the output of DEVOLUTION on the events feasible for analysis with MAGOS i.e. solely events present in all biopsies. The rightmost phylogenetic tree is the output of DEVOLUTION based on all mutations that passed quality control, i.e. also events that are found in merely a subset of samples. The information used to produce the phylogenies can be found in Supplementary data 6.
DEV on MAGOS events DEV on all events  Supplementary Figure       Algorithm for making sure all clusters are allocated in the correct order in all samples.

Definitions:
space: The first column is the cluster name and the second column is the space still available at this level. totalspace: All final space for the clusters across samples. equalclones: Clusters placed descendant to another cluster that is of the same size. possible_mothers: A matrix representing possible mother clusters for each cluster. allocation_samples: The chosen mother clone for each cluster across biopsies. Is there a rule for the allocation of this cluster, either by the user, that it is the only solution possible for the allocation to work across samples or that it has been specifically chosen by the algorithm above?

Select a biopsy
Yes No Double check that the mother clone exists in the sample and that there is enough room for the daughter to be placed in it.
Extract the allocation it has to have.
The mother but not the daughter The mother and the daughter The daughter but not the mother

None of them
Remove the daughter's space from all clusters in the equalclones including the mother cluster.
If the daughter's and the mother's space is equal in size the daughter should be added to equalclones for this group.
Are they of the same size?
Subtract the space of the daughter from all clusters in equalclones with the mother.
Remove no spaces. They are simply placed together. Make sure the daughter clone actually is present in equalclones together with its mother clone in this case. Yes

No
Remove the daughter's space from the mother's space.
Remove the daughter's space from the mother's space.
Are they of the same size? Remove no spaces. They are simply placed together. Make sure the mother clone is added to the equalclones column in which the daughter is.
Are they of the same size? Add both of them to the equalclones matrix. Do not remove any spaces.

Yes
No Choose the second most prevalent possible mother. Otherwise choose the largest space.

Yes
Yes, but it is not allocated yet.
Does the mother in turn have a conditioned mother? Has it been allocated? The previously declared could be smaller in size. Switch places in that case.
Allocate these clusters so the order is correct.
Allocate the mother to the conditioned place Allocate the mother to the largest space.
Is the mother or the daughter in equalclones?
No Yes Yes Yes

Yes No
No Choose the preferred allocation by the algorithm above. If no such is available, choose the largest space.
Add information in a matrix about the cases where it could not be allocated correctly.
If there are still clusters to be allocated in this biopsy If all clusters in the biopsy have been considered and there is still biopsies left to consider. Check if the equalclones matrix is still valid. Did the clusters in the matrix get placed together? If not, remove them from the matrix so that it is correct.
Begin creating the event matrix. Initially each row is a genetic alteration and each column a cluster in a specific sample.
Make sure clusters placed together in equalclones get each other's events All columns should have a "1" in the events belonging to the stem.
Select an event in the event matrix Make use of the matrix illustrating which events belong to each cluster.

Select a cluster in a particular biopsy
Add a "1" in the matrix to represent the presence of this event in this cluster.

Is the event present in this cluster?
Have all clusters been considered?
Have all events been considered?
No Yes Use the allocations_samples matrix to stack the clusters on top of each other in the order determined previously Select a biopsy Select a daughter clone Find the mother it has been determined to have.
Add to a matrix in which the first column is the daughter, the second one the mother and the third the size of the mother.
Have all clusters been considered in this sample? The other clusters should also have the events of this mother as well as each other's events.
Have all clusters been considered in this sample? Give the daughter the mother's events in the event matrix.

Multiple solutions?
Is there any place where there is room?  Add all of these matrices together by column Have all subclones been handled?
Output the final phylogenetic tree with pie charts Transform the event matrix into phyDat format with levels 0 and 1.

Input: -The event matrix
Output: -The event matrix in phyDat format which can be used as input to construct the phylogenetic trees.

An excel sheet that is provided to the user that contains:
-The event matrix -Which subclones are considered certain and uncertain.
-The input tumor data file along with the cluster to which each event belongs annotated. -Overview: The MCF of each genetic alteration across samples.

SUPPLEMENTARY NOTE 1 -THE PROPOSED IDEA
Using bulk genotyping provides information that can be used to compute the proportion of cells in each sample that have a particular aberration. To generate phylogenetic trees illustrating the relationship between subclones present in the tumor, which aberrations reside in the same cells as well as which subpopulations of cells the tumor consist of must be known. A subclone is defined as a cell having a unique genetic profile.  figure 1a. d) In a clinical situation some biopsies would be taken of the tumor. This is our window into the subclonal composition of the tumor. e) By searching for genetic aberrations that seems to follow each other over several biopsies the subclones can be reconstructed. In this example alteration A is in all cells and B in almost all cells. Alteration C and D are only found in the area comprised of biopsy B1. Alterations F, G and H seems to be in a subclone found in the area comprised of biopsy B2, B3 and B5 and alteration E is found in area B4. These former sets of aberrations are most likely in the same cells, while the latter aberration is not. The next step is to elucidate the temporal order of the events to obtain the subclones in the tumor. From the table we can identify groups of cells having alteration A, B, ABE, ABCD and ABFGH, which is in well accordance to b).

1.
Preprocessing of the data.

2.
Clustering of genetic alterations based on information from multi-region sampling from the same patient. 3. Subclonal deconvolution based on information from multiple samples from the same patient. 4. Construction of an event matrix. 5. Making use of a mathematical model to reconstruct the phylogenetic trees, in this case a. Maximum likelihood b. Maximum parsimony

SUPPLEMENTARY NOTE 3 -PREPROCESSING OF THE DATA
The input data file should have the following format, exemplified here by WT11 (Wilms Tumor number 11) in Figure 2. Each row represents a genetic alteration in a biopsy specified by column 2.
The method used to detect the genetic alteration is illustrated in column 9. Any method allowing the construction of a matrix as the one in Supplementary Table 3 can be included in the analysis. Hence for example SNP-array, WES and WGS data can be combined in the same phylogeny.
Supplementary Table 3 A portion of the data obtained from SNP array analysis. In the first column the tumor ID is declared, in this case Wilms Tumor number 11. In the second column the biopsy name can be seen, indicating which aberrations have been found in which biopsy. Columns 3-5 shows the location of the alterations on the chromosomes. Columns 6 and 7 list the log2 median values for copy number aberrations and variant allele frequencies (VAF) for point mutations, respectively. Columns 8 and 10 represent the type of aberration. A GAIN means that there is one extra copy of this particular gene segment. Similarly, a LOSS indicates that one copy of the gene segment has been lost. The rightmost column gives the fraction of the cells in that particular biopsy that harbors this aberration, denoted the mutated clone fraction (MCF).
The input file containing the genetic alterations across biopsies, as in Figure 2, should be imported into the script from an excel sheet (xlsx) using the function load_matrix (Supplementary Figure 19). This makes sure it is in the correct configuration for downstream analyses.
The input file may contain information from multiple tumors i.e. matrices from different tumors may be positioned after one another in the sheet by the user. The function splitdata (Supplementary Figure 20) allows extraction of data from a particular tumor, which name is provided by the user. The user also obtains a matrix with the start and end positions (row numbers) for each tumor in the input file provided. It does not presuppose that the samples are sequentially arranged. This is done automatically by the algorithm splitdata. It arranges the rows by tumor, then by sample name within each tumor data set and subsequently from lowest to highest chromosome number within each sample.

Preprocessing
When calling the function, the user can choose a cutoff for the genetic alterations in the segment file to be considered separate events, reflecting the measurement uncertainty of the start and end positions of the genetic alterations. The user also chooses whether the phylogenetic tree should be rooted in a normal cell (containing no alterations denoted "Normal") or a cell encompassing the alterations shared between the subclones (denoted "Stem"). Sometimes a mixed segment file with SNP array data, WES data etc. can be at hand. Therefore, the user can also choose which data to include in the analysis. In this way data from multiple different methods can be analyzed separately or in unison without having to separate the data manually.
Initially the algorithm checks if an event cutoff has been provided, if not a default of 1 Mbp is set (Supplementary Figure 21 1-2). Then it goes through the segment file to see if there are any missing columns. If there is a missing column a warning message is declared in the console. The missing column is replaced with "NA" (not a number) and the algorithm continues. If essential columns are missing the algorithm will halt (Supplementary Figure 21 3-4).
Subsequently missing values of MCF are identified. Missing values indicates that the MCF has not been able to be determined for a technical reason. This was changed to 100 % if the event did belong to stem, defined as the presence of the alteration in ≥ 90 % of the cells in all samples (default) or a cutoff provided by the user. The event was removed entirely if it was part of a subclone to not overestimate genetic variation within the tumor (Supplementary Figure 21 5-6). The user can also declare which data to include in the further analysis, for example to only include genetic alterations identified with SNP-array or WES. The user can declare it as "all" if all events, regardless of method, should be included. If not the "all" argument is given, alterations provided with a method not declared by the user, will be excluded from the data file for further analysis. In this way the user can in a simple way choose which combinations of methods to include for example to do one tree with SNP-array+WES, one with only SNP-array and one with only WES without having to rearrange the matrix by hand (Supplementary Figure 21 7-8).

Localizing unique genetic alterations
A clustering algorithm was constructed to localize all unique genetic alterations throughout the tumor samples. The algorithm loops through the rows of the segment file consisting of the genetic aberrations across the biopsies. For each row it compares the type of genetic alteration and their position on the chromosome to the other rows, representing other detected genetic aberrations throughout the samples. If two events in different samples are on the same chromosome, constitute the same type of alterations and if the events' start or end positions differ by less than a certain cutoff, set by the user based on the measurement uncertainty of the data set, they are considered the same type of event, but detected in different samples, otherwise they are considered different types of events in the evolution of the tumor. Each of these parameters are considered together in and and-statement. If they are identified as different events, a version name (i.e. v1, v2, v3 etc), is added to the alteration name for one of them, in order for the algorithm to be able to differentiate further on that they are different types of events (Supplementary Figure 21 9-14).
The default of the cutoff (co ev ) for DEVOLUTION is 1 Mbp (mega base pairs). Since the chromosome sizes ranges from 48-250 Mbp this cutoff constitutes a start and end point deviation of 0.4-2 % of the chromosome length.
All of the following conditions have to be met in order for the algorithm to consider the two alterations analyzed to be identical. See Figure 3 for an example. The segment file was subsequently updated based on the clustering.

The overview matrix
An overview matrix was constructed, which is defined as a matrix that visualize the MCF of each genetic alteration across biopsies (Supplementary Figure 21 15, Supplementary Figure 34).

Stem algorithm
The next step is to see if there is a declared stem and if all stem events in the data set have been declared correctly. In this study the default definition of a stem is an event that is present in ≥ 90 % of the cells in all biopsies. The cutoff can be changed by the user. The algorithm loops through the overview matrix displaying the MCF:s across biopsies to identify which genetic alterations fulfill the criteria to be a stem event. The identified stem events are set to 100 % in all samples. They are subsequently removed from the segment file and added together in the beginning of the segment file. See an example below of the same segment file before and after passing through this algorithm. The overview matrix is updated with the changes and the stem events are placed in the beginning. This saves a lot of time for the user, since no manual stem declaration have to be made and it allows flexibility since the stem cutoff easily can be changed. If there for some reason is no stem event identified across the samples, the user will be notified through the console, and an artificial stem event will be added at the matrix header (Supplementary Figure 21 16-22, Supplementary Figure 35).

Clustering genetic alterations
A clustering algorithm was then applied to group events that seem to have the same pattern over multiple samples. As discussed in the beginning of this document, alterations that follow each other in MCF over samples are most certainly within the same subclone in the tumor and is thus the group of genetic alterations that distinguish this group of cells from its ancestors. Note that these cells also can have other genetic aberrations, but they are not specific for this particular subgroup of cells compared to their ancestors. Simply identifying these clusters, is not the complete solution since nesting of the clusters still must be made.
In order to assess the similarity between the alterations from the different tumor regions, a clustering algorithm was used. Clustering in higher dimensions is difficult due to a divergence of the Euclidean distance between data points because they will deplete the center and concentrate in the shell of the ndimensional space. In our case the dimensions consist of the number of samples, from the same patient, hence they are not completely independent of each other, and the total number of eigenvectors is thus presumably lower than the total number of biopsies, which reduces the dimensionality and the problem.
Density based clustering techniques such as DBSCAN is efficient at clustering non uniform clusters and it allows clustering without specifying the number of clusters beforehand, which a pre-requisite with many other established clustering algorithms. In addition, it does only have two hyperparameters named minPts, which is the minimal number of points that is allowed in a cluster, and which is the radius in which points are included. The choice of can be aided by using a k-distance-graph which illustrates the distance to the minPts-1 = k nearest neighbor. The value to choose is when this plot shows an elbow (Supplementary Figure 21 23 and 36). The algorithm can be explained as follows,  The clustering method can be changed in the code, and it is easy to add your own.
After the clustering, a matrix containing each cluster and its included genetic alterations is constructed. Let C KxN be the matrix representing the clusters of genetic alterations. It has the dimensions KxN where k is the number of genetic alterations in the cluster and n is the cluster number. All matrix positions C kxn ≠ 0 are unique i.e. the same genetic alteration cannot belong to multiple clusters (Supplementary Figure 21 24). The algorithm also constructs a matrix representing the clusters present in each biopsy and their size determined by the mean of the aberrations in the cluster (Supplementary Figure 21 25).

Finding the possible allocations for each cluster
We now have the clusters of genetic alterations across biopsies. These clusters are although not the actual subclones, merely events that aid the identification of the subclones. The actual subclones will consist of a linear combination of these clusters of genetic alterations. To identify the subclones the clusters are nested, taking their prevalence in the biopsy in consideration and by combining information across biopsies in the process.
First, we need to define a space matrix including the subclonal partitioning of each biopsy. The space available in a single biopsy is 100 % and the space of all biopsies can thus be represented by a matrix where p is the partitioning of the available space in the biopsy and s pb is the space available in a specific partitioning p in biopsy b. Initially s 1,b = 100 ∧ s p≠1,b = 0 meaning that we start with the stem events in the bottom. The non-stem-clusters are then allocated to this space where there is still room for it to be placed. The clusters of aberrations in each biopsy, as supplied by Z CxB , are allocated to the space in decreasing order, altering the magnitude of the spaces in S PxB based on the MCF of the clusters allocated to it.
A biopsy is selected. The clusters identified in this biopsy are ordered from the one with highest MCF to lowest. In this way, there always will be at least one way to nest the clusters. The largest cluster not yet allocated is chosen. The space available in the biopsy is considered. The algorithm identifies in which clusters, already allocated, this cluster can be nested. The algorithm also considers if previously allocated clusters could be placed in another way to reveal additional possible allocation patterns.
The possible allocations for the cluster are catalogued in a matrix. A separate matrix, named equalclones, is introduced which includes cases where clusters of the same size (MCF) are nested in each other in order to keep track of these situations. The reason for this is that we do not yet know the order of the genetic alterations when the clusters are of the same size. To know that additional information from other samples is needed, which might reveal that one of the clusters have a smaller MCF than the other, providing information about the temporal order of these genetic alterations. When the information about the allocations have been saved, the largest cluster not yet allocated is extracted and the same procedure is repeated. This continues for all clusters in the biopsy and then it is repeated for all biopsies. In the end a matrix displaying the possible allocations for each cluster in each biopsy is obtained (Supplementary Figure 21-22 26-36). The next step will thus be to find a unified solution that is feasible across samples. Below (Supplementary Figure 37) is an example of this allocation process with the space matrix illustrated below. Figure 37 An example of the allocation of clusters. In this sample (S1) there are 4 identified clusters of genetic alterations, cluster A, B, C and D. The genetic alterations in these clusters have an MCF of 100 %, 60 %, 30 % and 10 % in S1 respectively. As the clusters are allocated to the space the space matrix visualized below the ovals change. Initially the space is merely occupied by the stem events. In the first step there is only one possible allocation of Cluster A S1 (denotation of cluster A in sample 1). This cluster then takes over the entire space of the matrix, meaning that additional genetic alterations in the biopsy must be equal or later in the temporal evolution of the tumor. In the next step cluster B S1 is nested in Cluster A S1 since this is the only position for nesting feasible. Cluster C S1 can be nested both in Cluster A S1, resulting in the presence of subclones of cells that have the genetic profile of cluster A + C, or nesting in cluster B giving cells having cluster A+B+C. Cluster D S1 can be nested at all levels. If the clusters are nested consecutively, as in the top row of ovals, the space will change as is visualized by the matrix below them. In possible_mothers the possible allocations for each cluster are saved.

Looking for discrepancies between samples
As was discussed in the previous section, there might be multiple solutions. If there are multiple biopsies available, information can be combined from each biopsy to find a unified solution not in disagreement with any of the samples.
Before doing this, the equalclones matrix is revisited to make sure that the equalclones actually can be placed together by analyzing their pattern across samples. They are not allowed to both be prevalent without one another. They are not allowed to cross each other in MCF across biopsies. There may also be cases in which it is not even possible to nest them together in some samples even though they both are prevalent (Supplementary Figure 22 37, Supplementary Figure 38).
Supplementary Figure 38 Illustration of how information across samples (here S1, S2 and S3) can be unified to conclude that two clusters cannot be nested. Three different scenarios are displayed to exemplify this. In the first sample, S1, two clusters are nested together and consequently also added to the equalclones-matrix. In the top row the clusters are found without one another in separate samples. Hence there exist cells having the alterations of these clusters independently, making it unlikely that these alterations are in the same cells. In the middle row the MCF values cross one another across samples. In the bottom row nesting if not possible in all samples, even though they are both present. In S2 the blue cluster must be nested in the green cluster which in turn is nested in the red cluster, while the blue cluster is nested in the red in S1 and S2. Hence the temporal evolution of the genetic alterations across samples are contradictory for this solution and the most probable solution is that these clusters are present in separate cells and should not be nested in S1.
Subsequently the equalclones should get each other's mothers, as well as each other as mother, in possible_mothers in the samples in which they are of equal size (Supplementary Figure 22 38,  Supplementary Figure 39).
Supplementary Figure 39 Clusters of the same size that are nested in a particular sample should get each other's mothers in that sample. In sample S1 cluster C S1 and cluster D S1 are nested in each other. Solely based on this sample it cannot be determined if a cell have gotten the cells of cluster A, followed by C and then D or if the cells have obtained A first, followed by D and then C. Hence, the clusters get each other's mothers when they are nested in a sample and are of the same size. By including information from the other samples, it can be determined that cluster D should be nested in C which in turn is nested in A, which stresses the need to render the possible_mother matrix in subsequent steps to obtain the most probable solution of the temporal order of the clusters.
The user is also allowed to provide rules for the nesting of clusters of genetic alterations. Some genetic alterations may be known through studies, for example, to never co-occur. In those cases, the user can provide the algorithm with a rule, forbidding these events to be placed in the same cell. This rule is taken as input by the DEVOLUTION algorithm in the shape of a matrix with two columns in which events in the same row are the ones that should not be nested in each other (Supplementary Figure  22 39-40).
The next step is to look for a unifying solution between the samples of the tumor. We want to make an allocation of the clusters that does not contradict any of the biopsies. We fuse the information from multiple biopsies to extract the trajectory of genetic alterations that is most probable. If the data set only contain information from a single biopsy, no such comparison is needed or possible, but the algorithm can still do the nesting and produce a phylogeny for the user. There is always a unique solution. If there are multiple biopsies each cluster is initially considered separately.
First, a cluster not yet processed is chosen. The possible nesting of this cluster across the biopsies in which it exists are extracted. A table is constructed that shows in how many of the samples each nesting is possible. Hence identifying which nestings is possible in all, or almost all samples, and which nestings are only possible in a subset of biopsies, making this a less probable nesting pattern. The clusters that can be selected as the mother cluster in all biopsies are identified. Ideally there would only be one single possible allocation that is feasible in all samples. If there is only one, that is chosen as the mother cluster. Sometimes there may be multiple solutions equally probable. In that case we first assess if the user has provided the algorithm with a rule forbidding certain events to be placed together in the same cells. If so, those allocations are removed. As a next step the MCF of each cluster across samples is extracted and how the MCF-sizes changes across the samples are assessed.
Mothers that rise and fall in relation to the daughter cluster are selected as potential mother candidates. If there are multiple such, the one providing the largest space is chosen, unless there is a rule provided by the user, in which case that rule overrides the solution. If there is one single cluster that follows the daughter cluster in size across biopsies, that cluster is chosen. If no cluster follow the daughter clone in sizes and no rules aid us in the choice of allocation, the largest space is chosen. In the end of this part of the subalgorithm whether any of the mothers are not possible any more due to previous allocations of clusters. There may for example be some other cluster that have to be nested in this mother cluster, which decreases the space available, which might lead to the cluster handled now to not be able to be allocated to that position anymore. In that case, it is removed as a possible_mother for the daughter cluster. The other possible mothers are moved accordingly in the matrix, where the first row represents the cluster that is the preferred mother cluster and the other clusters less preferred, given the rules above. Hence, if the preferred solution is removed through this algorithm, the second most preferred allocation is chosen instead (Supplementary Figure 40). There might be a situation where there is no allocation of the cluster that is allowed in all samples. We first assess whether some allocations are the only solution in any samples or there are rules provided by the user, for which those clusters are given. If there are no rules, we look for allocations possible in all but one sample. When there is no indication of which is the correct allocation the one providing the largest space is chosen.
The procedure is repeated for all clusters identified across the biopsies. We now have a determined allocation for every cluster across the biopsies. To minimize the risk for contradictions in the allocations due to allocations of previous clusters, the subalgorithm is rerun once to make sure all mothers are still possible when they are allocated. This takes minimal extra time for the code (< 0.2 sek) but provides a more robust output (Supplementary Figure 22 41-71).

Algorithm for making sure clusters are allocated in the correct order in all samples
The aim of this algorithm is to nest the clusters across the biopsies in the order determined by the algorithm described in the previous section. In the process the final partitioning of the tumor space i.e. the spaces left at each level of the nesting, corresponding to the proportion of cells in the biopsy belonging to a certain subclone is computed as well.
A biopsy is selected followed by extraction of all clusters identified within that biopsy and their corresponding MCF:s. These clusters are the ordered from largest to smallest. By doing this we make sure there will always be at least one solution for the nesting of each biopsy. The largest cluster not yet considered is chosen. Initially we start with a space in which 100 % of the cells have the genetic alterations i.e. they have the stem event. Hence there will only be one single space, one of the size 100 for the new cluster to be placed in. In general, the largest space available is always identified in the beginning of the algorithm.
If there is a rule for the allocation of this cluster, either provided by the user, it is the single solution in the sample or that a particular allocation has been specifically chosen in the algorithm above, we extract the mother cluster it must be nested in. We double check that this mother even exists in the sample and that there is enough room for the daughter cluster to be nested in it. If this is not the case, the second most preferred mother in possible_mothers is chosen. If not present the largest space is chosen. We also save information in a matrix for the cases in which the cluster could not be allocated correctly. Another case may be that the mother exists in the sample but has not been allocated yet. This mother cluster (eg. M1) may also have a mother cluster (eg. M2) that have not been allocated yet, in which case the clusters are allocated such that M2 is allocated to the space first, then M1 is nested within that and they the daughter cluster is nested in M1. If M2 has been allocated we nest M1 in M2 and then the daughter cluster in M1. If there is no rule for where the mother should be placed, it is allocated to the largest space. If there is no rule for the allocation of a certain cluster the allocation the allocation determined by the algorithm where we considered discrepancies between samples is used.
In the next step whether the mother or daughter cluster is in equalclones is considered. For example, if a cluster A consist of 5 genetic alterations and have an MCF of 80 % and another cluster B have 3 genetic alteration that also have an MCF of 80 % in this particular sample, these clusters will be nested. This means that 80 % of the cells in this biopsy will have both the genetic alterations of cluster A and B. If another cluster C is 40 % the cells encompassing these 40 % of the biopsy will have the genetic alterations of cluster A, B and C. In another sample the MCF:s might differ from one another such that we know that cluster A > B > C. Hence C will be nested in B and B nested in A. This order of the nesting must be taken into account when constructing the event matrix.
There are four different situations that can occur: • The mother is in equalclones but not the daughter: Then we will remove the daughter's space from all clusters in the equalclones including the mother cluster. If the daughter and mother are equal in size, the daughter should be included in equalclones together with the mother. • The mother and the daughter are in equalclones: If they are of the same size no spaces are removed. The clusters are simply nested together. Double check if they are placed in the same column in the equalclones columns for this sample. Otherwise, it should be moved such that it is. If the mother and daughter cluster are not of the same size, the size of the daughter is subtracted from the mother's space along with the equalclones placed together with it. • The mother is not in equalclones but the daughter is: If they are of the same size no spaces are removed. The clusters are simply nested together. We must make sure the mother cluster is added to equalclones together with the daughter. • None of them are in equalclones: If they are of the same size they should both be added to equalclones. No spaces are removed. If they are not of the same size the daughter's space is removed from the mother's.
The mother cluster chosen for each daughter cluster is saved in allocation_samples. The process is repeated until all clusters have been processed in all biopsies. In the end we will have a complete allocation_samples matrix that shows the nesting of all clusters across biopsies (Supplementary Figure 22-24 72-106).

Creating the event matrix
So, now the nesting of all clusters of genetic alterations across the biopsies is known. The next step is to construct the event matrix based on this nesting. See Figure 9 for an example.
An algorithm was constructed that controls the validity of the final equalclones matrix. There may be clusters in this matrix that in the end got nested at a different position, in which case it should be removed from the matrix (Supplementary Figure 24 107).
In this first version of the event matrix, each row represents a genetic alteration and each column the identified subclones across samples. At this stage, the same subclone can appear multiple times, but in different samples. This gives us information about its size across samples and in which samples it is found and not.
An event in the event matrix is chosen, followed by a biopsy and then a cluster within that biopsy. If the event is present in this cluster a "1" is added in the event matrix at the column representing this cluster in this biopsy. Repeat this procedure for all genetic alterations. In the end the matrix will show the events encompassed in each cluster. Then a "1" is added in all columns for the events incorporated in the stem, which should be present in all cells (Supplementary Figure 24 108-116). The next step is to make sure the determined nesting pattern is correctly translated into an event matrix.
A biopsy is selected followed by a daughter cluster within in that sample. The mother cluster it should be nested into is identified using the allocation_samples-matrix in which the nesting pattern is displayed. A matrix is constructed in which the first column is the daughter cluster, the second the mother cluster and the third the MCF of the mother cluster. The matrix is then ordered from largest to smallest. Choose the largest one not yet considered. If the daughter is part of equalclones the other clusters within this should also have the events of this mother. If it is not part of equalclones the daughter is simply given the events of the mother in the event matrix. This is repeated until all clusters in all samples have been handled. Then the output event matrix is obtained (Supplementary Figure  24 117-131, Supplementary Figure 41).

Function: Subclones
This function identifies all subclones across samples, defined as a cell population having a unique genetic profile. The algorithm also identifies additional possible solutions for the phylogenetic tree structure, which the user can assess to produce alternative phylogenetic trees that still explain the data set at hand.
First, all unique subclones across the samples are identified. This is done by finding all unique combinations of the genetic alterations in the event matrix and giving them new unique subclone names, creating a new event matrix in which each column is unique. Also, a column representing the stem is added. This is either a normal cell or a cell containing all alterations fulfilling the criterion for being a stem event (Supplementary Figure 25 1-6, Figure 41 bottommost panel).
By making use of the phylogeny and possible_mothers a matrix showing which clusters are placed at each level in the phylogeny across biopsies as well as the proportion of cells having these alterations is constructed. Hence for each level corresponding to different subclone names in the phylogeny, which clusters are nested on top of this as well as the size of these clusters are known ( Supplementary  Figure 25 3). Clusters violating rules in other samples are removed. At this step, a matrix is given to the user containing which subclones in the phylogeny are certain (there is only one possible allocation of the clusters resulting in this subclone) or uncertain (the clusters contained in this subclone could be allocated to other places such that this subclone does not exist) (Supplementary Figure 25 7-9).
If at least one subclone is uncertain there are multiple solutions for the phylogeny. In that case, the user is asked whether an alternative solution should be shown or not. If the user does not want that, the suggested solution is shown. If the user would like to opt for an alternative solution, as a first step a matrix is created that illustrates which clusters across biopsies are nested in which. In this matrix there cannot be any duplicates. For example, cluster A in a biopsy 1 (denoted B1_cluster_A) in Figure 7 can only be nested in one single position. Based on this information the size of the space in each level of the tree with and without the nested clusters can be computed (Supplementary Figure 25 10-14,  Supplementary Figure 41).
The user can choose a cutoff for which subclones to reshuffle. The matrix showing certain and uncertain subclones can be used as an aid for selection of this cutoff. The default is 30 %. In that case, all clusters < 30 % in all samples will be randomly reshuffled in the tree. To do this all clusters fulfilling this requirement are removed from the phylogeny, leaving a tree only containing the subclones for which there is only one nesting pattern of the encompassed clusters. The disconnected clusters are arranged from largest to smallest. The largest cluster not yet allocated is chosen. The algorithm finds at which levels of the tree i.e. after which subclones this cluster could be allocated, without breaking any of the previously established rules for the allocation of this cluster. One of the allocations fulfilling this is randomly chosen. This alters the spaces remaining in the tree. Then the next disconnected cluster not yet considered is chosen and the process repeated. When all clusters have been reestablished in the phylogeny an alternative solution is at hand. The user will get a new event matrix as an output (Supplementary Figure 25 15-24) for which a phylogeny can be reconstructed.
The complete number of alternative solutions can be very large for cases in which a vast number of clusters with small MCF:s are identified across samples. The phylogenetic tree structure may still not change as much. There may be a couple of clusters with large MCF:s that only can be nested in a single way and then a lot of small clusters of very small MCF:s that can be nested in many combinations, both within the large clusters and within each other, resulting in a large quantity of phylogenies, but where the main structure of the tree remains stable. Hence, the number of possible phylogenies, provides very little information about the reliability of the phylogeny at hand. The user can, through the algorithm, choose whether or not to color the tip labels for the subclones depending 15 on which ones were created using clusters for which a single solution was possible and for which multiple solutions were possible, providing a visual cue of stable and dubious parts of the phylogeny.

SUPPLEMENTARY NOTE 5 -FUNCTION: MAKE_PIE
This functions loops through all subclones and extracts its size across samples. Then it produces a matrix in which the first column is the proportion of cells in this sample that have the alterations belonging to this subclone and the second column is this size minus 100. Together these values across biopsies are used to produce the pie charts belonging to each subclone (Supplementary Figure 26 1-10).
The user can choose between three different color modes for the pie charts.
• Nocol: The pie charts have the default color red. Above the pie charts their corresponding sample name is illustrated. • Col: For each sample, a color is randomly picked to represent it. The pie charts will hence be colored and there will not be a sample name above the pie charts. A legend is produced that indicated which color corresponds to which sample. • Custom: The user can itself choose which color each sample should be colored with.
The output is one figure for each subclone in which the pie charts are arranged besides one another (Supplementary Figure 26 11-17).

SUPPLEMENTARY NOTE 6 -FUNCTION: PIE_IT
It adds the pie charts to the phylogenetic trees. As an input the user have to provide the phylogenetic tree, the pie charts, a number for the offset of the pie charts relative to the subclone names and a number for the size of the pie charts. The offset and size may have to be altered based on the size of the phylogenetic tree.
The algorithm loops through the subclones in the tree. It finds its position, its corresponding pie chart and adds the pie chart to the corresponding position (Supplementary Figure 27 1-7).

SUPPLEMENTARY NOTE 7 -FUNCTION: PHYDATEVENT
Transforms the event matrix into phyDat format so that it is compatible with the phangorn package to produce the trees (Supplementary Figure 28 1-3).

SUPPLEMENTARY NOTE 8 -EXCEL SHEET OUTPUT
From the algorithm there will be an excel sheet provided that includes the event matrix, which subclones in the phylogenetic trees are considered certain and uncertain, the segment file with the cluster to which each event belongs annotated as well as an overview matrix containing the MCF of each genetic alteration across samples (Supplementary Figure 29 1).