Introduction

Cancer is heterogeneous in the sense that the cancer cells in a tumor are not genetically identical, but form distinct clones, defined as subpopulations of cancer cells that host the same genomic aberrations. In aggressive and metastatic cancers, these genomic aberrations quickly evolve, resulting in extreme spatial and temporal heterogeneity1,2. Therefore, multiple biopsies over different locations and at different time points need to be collected and sequenced in order to capture the complexity of tumor genomic landscapes and provide insight into how tumors evolve and escape treatment3,4. Accordingly, computational tools are needed to accurately characterize the clonal structure of cancer and reveal how that structure evolves over time.

In recent years, a large number of computational tools and statistical models have been developed to analyze tumor heterogeneity from DNA sequencing data (Table 1). However, most of these tools only model one type of genomic aberration, such as single-nucleotide variants (SNVs), copy number alterations (CNAs), or structural variants. Restricting the analysis to a single type of mutation not only reduces statistical power to accurately detect the clonal structure within the tumor, but also prevents us from understanding interactions among different types of mutations. Furthermore, many SNV-based methods assume that no copy number changes have occurred, which is extremely improbable. Therefore, their estimation of the prevalence of a given clone can be inaccurate, and the corresponding heterogeneity results may be misleading. Existing methods that capture SNVs and CNAs in the same model (i.e., phyloWGS5, SPRUCE6 and Canopy7) require running a CNA-calling algorithm before heterogeneity analysis, but accurate CNA characterization also depends on heterogeneity analysis.

Table 1 Software tools for characterizing within-patient and intra-tumor heterogeneity and their features, including whether they capture SNVs, CNAs and structural variants (SVs), whether they support multiple biopsy analysis, and their key models and algorithms.

Most existing tools are designed to analyze a single tumor biopsy and are not suitable for jointly analyzing multiple biopsies. As DNA sequencing becomes more affordable, we can more easily collect multiple biopsies from a single patient during treatment. If we only perform heterogeneity analysis on the individual biopsies, then we are unable to detect clones that are shared across different biopsies from the same patient, and we fail to address important questions about how the tumor cells evolve, metastasize and escape treatment.

Finally, although most models are free and publicly available, it is difficult to extend them by adding new assumptions and new types of biological data. Even under the best of circumstances, significant effort is required for users to fully understand the source code. In many situations, data structures and computational algorithms prohibit other investigators from modifying the model to accommodate their special needs.

To address these challenges, we propose THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows us to jointly characterize different types of genomic aberrations from multiple biopsies using a dynamic graphical model. We implement our model via an extensible modeling platform, the Graphical Models Toolkit (GMTK)8, which makes our approach open, reproducible and easy for others to extend. To extend the model, users only have to modify the model specification files; GMTK then automatically handles the required computation. Simulation experiments demonstrate that THEMIS significantly increases the accuracy of recovering tumor subclones and their genotypes, compared with its ancestor, TITAN9. Single cell DNA sequencing confirms that individual nuclei can be segregated into one of the two tumor subclones identified by THEMIS. We applied THEMIS to three tumor biopsies from one cancer patient, thereby revealing the mutation accumulation history of the patient, tracking cancer progression, and identifying mutations related to developing resistance following various treatments.

Results

The Model

From bulk next generation sequencing data, we define two primary observations at each genomic position: the allelic ratio, defined as the proportion of the reads containing a specified allele among all reads aligned to the site, and the log ratio between tumor read depth and normal read depth (Fig. 1a). From these inputs, we aim to infer the number of distinct clones, the full genotype of each clone, and the prevalence of each clone within each biopsy (Fig. 1b). To carry out this inference, TITAN9 uses a dynamic graphical model, in which each frame represents one genomic position, and the allelic ratio and tumor/normal log ratio are observed at each frame. The backbone of the TITAN model consists of two hidden Markov chains, one representing the genotype of the CNA event at the current position, and the other representing the clone in which the CNA event occurs. Our model, THEMIS, is similar to TITAN in the sense that both models are dynamic graphical models with each frame representing a single genomic position, with CNA events captured by hidden Markov chains. However, THEMIS extends TITAN by (1) jointly accounting for SNVs and CNAs, (2) jointly analyzing multiple biopsies, (3) estimating transition probabilities between hidden states of the model from observed data rather than fixing them at specific values, and (4) using an open and extensible modeling language (GMTK8). More details about the THEMIS model, including the modeling choices and assumptions, model’s structure, variables and parameters, are provided in Methods section.

Figure 1
figure 1

Example THEMIS input observations and the corresponding inferred outputs. (a) Inputs to THEMIS, including allelic ratio, log ratio and genomic position information. Somatic mutation sites are indicated by blue diamonds. (b) Outputs of THEMIS show that there are two tumor clones in the tumor biopsy, one parent tumor clone with 40% cell prevalence and one child tumor clone with 35% cell prevalence. The CNAs of the two tumor clones are visualized with color bars across the genome, and the genomic positions hosting an SNV are indicated by blue diamonds.

Simulation results

We first used simulated data to compare the performance of THEMIS and TITAN9. As a starting point for the simulation, we used the genomic positions measured in three tumor biopsies from three patients with triple negative breast cancer (Supplementary Table 1). For each set of genomic positions, we also specified three different sets of tumor subclone compositions. More details about the simulation experiments are provided in Supplementary Note 1. We evaluated (1) the percentage of sites at which the hidden genotype was incorrectly inferred, (2) the percentage of sites at which the clonal/subclonal status was incorrectly inferred, and (3) the percentage of sites at which either the genotype or clonal/subclonal status were incorrectly inferred. THEMIS outperformed TITAN in recovering the clonal/subclonal status and genotypes of the genomic positions in all experiments (two-sided paired t-test, p = 0.00137 for genotype recovery, p = 1.152 × 10−7) for clonal/subclonal status recovery, and p = 0.00198 for both genotype and clonal/subclonal status recovery), and reduced the recovery error by 13.3% on average (Supplementary Table 2). Not surprisingly, both THEMIS and TITAN performed better when the prevalence of the somatic events was higher.

Validation via single-cell DNA sequencing

The two tumor clones in Fig. 1b were identified from the bulk DNA sequencing data from an involved axillary node in a patient with metastatic triple negative breast cancer. We single-cell sequenced a second sample taken from the same axillary node at the same time, aiming to validate the subclones previously identified from bulk DNA sequencing. From a total number of 143,108 nuclei, 96 nuclei were fluorescence activated cell sorted (FACS) using gating to select for tumor cell nuclei, placed on a 96-well plate, and whole genome amplified (GenomiPhi). Indexed Nextera libraries were sequenced on the NextSeq using PE 150 bp mid-output flow cell. Six cells were removed due to extremely low numbers of reads after adapter removal (Supplementary Fig. 1). The sequencing data from another eleven nuclei was of low quality, as evidenced by a much larger fraction of short reads (Supplementary Fig. 2) and were excluded from our analysis. The remaining 79 cells were used for validation.

We use a Bayesian classifier (Supplementary Note 2) to assign each of the 79 cells into one of the three clones identified from the bulk sequencing data by THEMIS—one normal clone, one parent tumor clone and one child tumor clone. As input to the classifier, we use sequencing coverage on three types of genomic regions which are derived from the inferred genome-wide genotype for the two tumor clones (Fig. 1b), namely, clonal 2-copy (i.e. normal) regions, clonal 1-copy (i.e. loss of heterozygosity [LOH]) regions and subclonal 1-copy regions (Fig. 2b and Supplementary Table 3). Clonal 2-copy regions provide the baseline measurement of sequencing coverage in normal regions. Clonal LOH regions distinguish tumor from normal. Subclonal LOH regions distinguish the child tumor clone from its parent. The Bayesian classifier identifies 2 normal cells (indicating that FACS gating worked), 57 parent tumor cells and 20 child tumor cells (Fig. 2a and Supplementary Table 4). The observed ratio of parent to child tumor cells (2.85) does not agree with the ratio inferred by THEMIS (1.14), which is likely attributable to the very small number of examined events and that the single cell analysis was performed in a separate sample taken at the same time. However, the histograms of the normalized coverage rate in the 99 clonal LOH segments and 119 subclonal LOH segments in the three types of cells validate that the three categories of nuclei agree with our model (Fig. 2c). For example, cell 34 has no LOH events in the clonal LOH region nor in the subclonal LOH region, and is therefore a normal cell. Cell 26 displays LOH events in the clonal LOH region, but not in the subclonal LOH region, and is therefore identified as belonging to the parent tumor clone. Cell 1 has LOH events both in clonal and subclonal LOH regions, and is therefore identified as belonging to the child tumor clone. The histograms of the coverage rates from the aggregated 57 parent tumor cells show similar patterns to that of cell 26, whereas the histograms from the aggregated 20 child tumor cells show similar patterns as cell 1 (Supplementary Fig. 3). Therefore, our single cell experiment successfully validates the subclones identified by THEMIS.

Figure 2
figure 2

Results of single cell validation experiments. (a) Predicted cell category from the Bayesian classifier: normal cell (green), parent tumor cell (red), child tumor cells (purple) and unknown due to low sequencing quality (grey). (b) Three different types of regions that we use to distinguish the three types of cells, namely clonal 2-copy regions, clonal LOH regions and subclonal LOH regions. (c) Histograms of the relative coverage rate in clonal LOH segments and subclonal LOH segments demonstrate cell 34 is normal, cell 26 is from the parent tumor clone and cell 1 is from the child tumor clone. Red dotted lines in the histograms indicate the expected coverage rates in the cells.

Joint analysis over multiple biopsies from the ITOMIC study

The Intensive Trial of OMics In Cancer (ITOMIC-001) enrolls patients with metastatic triple negative breast cancer in whom biopsies of multiple metastatic sites are performed repeatedly over time10. For each patient, multiple biopsies are evaluated using next generation sequencing. We performed joint heterogeneity analysis on three biopsies from Patient 1 (Fig. 3a). Patient 1 was originally diagnosed with triple-negative breast cancer in February 2011, and enrolled in ITOMIC-001 in October 2013. The patient then received three different treatments, including cisplatin (between study days 12 and 125), an investigational PARP inhibitor veliparib (between study days 126 and 194), and the kinase inhibitor ponatinib (from study day 195 until the time of her death on study day 250). The first biopsy B1 was sampled from an involved right axillary lymph node, collected on study Day 7 (before cisplatin). The second biopsy B2 was sampled from the same right axillary lymph node on study Day 125 (after cisplatin). The third sample B3 was from a left peribronchial lymph node, collected at autopsy following the patients death on study day 250 (post ponatinib).

Figure 3
figure 3

Experiments and results from joint analysis of three biopsies from the same patient. (a) The collection of the three biopsies from the same patient during three stages of treatment and the inferred tumor clones in the biopsies. (b) The recovered phylogenetic tree and the mutations accumulated on each edge. The mutations are shown on a genome-wide plot, and the two numbers on each edge are the number of germline heterozygous sites affected by these CNAs and the number of SNVs. (c) Signal pathways (a simplified version of signal transduction pathways from Wikipedia), and the number of genes with copy number changes (copy gain and copy loss) in different stages of cancer progression. *Denotes at least one of the mutated genes is a core component of the signaling pathway.

We followed a five-step procedure to recover the phylogenetic tree from the three tumor biopsies (see Methods for details), which involves enumerating all possible candidate phylogenies from individual biopsy analysis and then selecting the best phylogeny by likelihood (Supplementary Note 3). The recovered phylogenetic tree (Fig. 3b) reveals the relationships among the recovered clones and the mutations accumulated at each stage of cancer progression. Parent clone A is shared by all three biopsies, and child clones (B, C, and D) inherit the mutations present in parent clone A and acquire new mutations of their own. Although clones B, C and D occur in separate biopsies, clones C and D appear to be descendants of clone B, meaning that the new mutations detected in later biopsies occurred within the child tumor clones. In addition to clones A, B, C and D, the phylogenetic tree includes an internal node for an inferred intermediate clone (CD) hosting the mutations shared between clone C and clone D, and corresponding to a splitting point between clones C and D in cancer progression. On the phylogenetic tree, we also label the new CNAs and SNVs on the edges. We visualize these mutations on a genome-wide plot, and provide the number of germline heterozygous sites affected by these CNAs and the number of the SNVs (Fig. 3b). The mutations on the edges of the phylogenetic tree reveal the mutation accumulation history of this patient and can help in tracking mutations related to treatment resistance. By the time the patient joined the study, there had been two major phases of mutation accumulation, one corresponding to the mutations accumulated in clone A and the other corresponding to mutations accumulated in clone B. Comparing the two phases, more SNVs emerged in the second phase. After joining the study and receiving further treatments, additional mutations emerged. The mutations on the edges between clone B and clone C emerged during treatment with cisplatin, whereas those on the edge between clone CD and clone D emerged during treatment with the PARP inhibitor veliparib (without response) followed by treatment with the kinase inhibitor ponatinib (which did yield a partial response). Because veliparib failed to affect tumor growth, we attribute the changes associated with the CD → D transition to ponatinib, which was given based on the presence of two activating mutations affecting FGFR2 (S252W;Y375C) (manuscript submitted).

A number of intriguing patterns were revealed when we looked at the mutations associated with different phases of treatment. First, many of the genes in CNA regions are known to be related to cancer. On the CNAs associated with the three treatments, including 147 CNAs during the treatment with cisplatin and 98 CNAs during the treatment with veliparib and ponatinib, we identify 848 genes and 519 genes, respectively. We queried these genes on the NCBI gene website (www.ncbi.nlm.nih.gov/gene) and retrieved 186 genes and 175 genes, respectively, that are known to be related to cancer in the literature. The retrieved genes are related to many important cancer signaling pathways (Fig. 3c). On some of the cancer signaling pathways (i.e., MEK/MAPK/Erk, PI3K/Akt/mTor, NF-κB, and p53), core genes (i.e., MAP2K7, PIK3C2A, PIK3CD, PIK3CB, TNFSF9, TNFSF14, NRAS, GSK3B, Notch3, TP53TG5 and SNAI1) experienced copy-number changes during different phases of treatment. In addition, the genes mutated during different phases of treatment show patterns that are potentially illustrative of different therapeutic responses. The proportion of genes experiencing copy number gains in later stages (i.e., on the edge CD → D, during the treatments with veliparib and ponatinib) is much higher than that in earlier stage (on the edges B → CD and CD → C, during the treatment with cisplatin), and the proportion difference is more dramatic in important cancer signaling pathways, including MEK/MAPK/Erk, PI3K/Akt/mTor and NF-κB. These genes mutated in the different phases of treatment also showed different functional focuses. According to DAVID11,12, the top function clusters among the genes mutated in the earlier stage are rho GTPase activation, growth factor, DNA damage and ErbB signaling. The top function clusters from the genes mutated in the later stage are DNA damage, ras signaling, nucleotide-binding, zinc-finger, neurotrophin signaling, and endocytosis. Third, a number of SNVs occurred on or near the genes known to be related to cancer signaling pathways, which allows us to investigate them together with the genes mutated due to copy number changes. During treatment with cisplatin, one SNV occurred near an intron/extron boundary within BIN1, which is known to interact with the myc oncoprotein as a putative tumor suppressor. During the treatments with veliparib and ponatinib, SNVs occurred on or near LATS1 (a core component of Hippo-YAP pathway), MGMT (related to DNA damage), IL17RB (related to NF-κB signaling) and APCDD1L (related to wnt signaling), all of which are known to be related to breast cancer. Although additional experiments are needed for further validation, THEMIS provides a powerful computational tool to generate hypotheses from multiple biopsy DNA sequencing data.

Discussion

THEMIS offers a powerful and extensible modeling framework to jointly capture different types of genomic aberrations in the analysis of multiple biopsies. The integration of CNAs and SNVs in the heterogeneity analysis increases the accuracy of clonal inference relative to previous methods that consider only single types of mutations. For example, if we observe an allelic ratio 0.3 at one genomic position, then the cell prevalence of the SNV should be 60% if the corresponding genotype is AB, but the prevalence should be 85.7% if the genotype is AAB. In such cases, methods that fail to jointly consider copy number information and SNVs can be misled. In addition, the integration of multiple types of mutations allows us to understand cancer comprehensively, and to address important questions such as how the different types of mutations cooperate with each other and what roles they play at different stages of cancer progression.

The joint analysis over multiple biopsies from the same patient provides a complete picture of mutation progression in the patient, which may shed light on how tumor cells escape treatment and metastasize. The ability to analyze multiple biopsies jointly will be increasingly important as DNA sequencing costs continue to decrease. The current turnaround time of analyzing the three biopsies with our model, including both data preprocessing and model running, is just a few hours.

Because THEMIS is built using a general purpose graphical models toolkit, the approach is easy to extend to alternative model architectures. For example, during review of this manuscript, one reviewer suggested that the THEMIS model likely over-segments the genome. We verified this effect empirically and then demonstrated the flexibility of the THEMIS framework by modifying the model to incorporate a user-specified constraint on the number of segments (Supplementary Note 4). In addition, GMTK provides flexible calculation in both estimation and inference, including both exact and approximate inference algorithms. Based on the available computing resources, the user can easily trade memory with running time. Using this modeling and algorithmic flexibility, we plan in future work to extend THEMIS to account for more complex types of mutations, such as chromothripsis and chromoplexy. We also plan to incorporate a more principled phylogeny reconstruction method into THEMIS. Ultimately, THEMIS will provide a testbed for model development by us and others interested in modeling the full complexity of tumor evolution.

Data availability

The bulk DNA sequencing data and the single cell DNA sequencing data used in our analysis can be downloaded from Sequence Read Archive with accession SRP102304.

Software availability

THEMIS is available at https://github.com/jieliu6/THEMIS.

Methods

Data preprocessing

We assume that next generation sequencing data was mapped to the reference genome, and the mapped BAM files are ready for analysis. Pre-processing of the data consists of three steps (Supplementary Fig. 4). First, we identify the genomic sites that will be included in the model. Our model captures both CNA events and SNV events; therefore, two types of genomic sites are included. For CNAs, we consider germline heterozygous sites since we can monitor not only absolute copy number changes (via tumor-normal read depth difference), but also what happens to the two individual copies (via allelic imbalance). For SNVs, we consider the somatic mutation sites which host an SNV event in any of the tumor biopsies. From the germline (normal) BAM file, we use Samtools to identify germline heterozygous sites. From the tumor and normal BAM files, we use MuTect13 to identify somatic mutation sites. Second, we filter out unreliable sites and reads using MuTect13. Third, we adjust for GC content and mappability. Short reads from next generation sequencers are not uniformly distributed across the genome—more reads are expected to be obtained from regions with higher GC content and mappability. The bias cannot fully be adjusted by normalizing with another next generation sequencing library (e.g. from a normal biopsy) from the same patient14. We therefore use HMMcopy15 to adjust GC content and mappability in the read counts.

Modeling choices in THEMIS

Unlike previous methods such as phyloWGS5, SPRUCE6 and Canopy7, which capture CNA or SNV events as the entities in the model, our model THEMIS and its predecessor TITAN directly model individual genomic positions as the entities in the model and therefore have the ability to perform CNA calling during tumor heterogeneity analysis. Both THEMIS and TITAN are dynamic graphical models with each frame representing a single genomic position, with CNA events captured by hidden Markov chains. Therefore, THEMIS inherits five key assumptions from TITAN:

  1. 1.

    Two primary observed variables—allelic imbalance and the tumor-normal read depth ratio—reflect the underlying somatic genotype of the tumor at germline heterozygous sites.

  2. 2.

    CNA events span multiple contiguous germline heterozygous sites.

  3. 3.

    The observed NGS data comes from heterogeneous cellular populations, including normal cells and tumor subpopulations.

  4. 4.

    Two mutation events are observed at the same cellular prevalence if and only if the two events come from the same subpopulation.

  5. 5.

    Only one CNA event can arise in only one tumor subpopulation at each genomic position.

Note that Assumption 4, although used by many tumor heterogeneity models, can be invalid if two different tumor subclones in a tumor have the same cellular prevalence. The purpose of introducing Assumption 5 is to make the heterogeneity model simple and identifiable; however, this assumption does prevent us from modeling more complicated situations in which multiple CNAs arise in the same genomic region.

We usually have around 30–50 thousand germline heterozygous sites and several hundred somatic mutation sites in whole-exome sequencing data from a single biopsy. With reasonable sequencing depth (greater than 100 reads per position, on average) the underlying genotypes (i.e. the type of the CNA event) estimated from the contiguous germline heterozygous sites can be inferred accurately. Integrating the somatic mutation sites and germline heterozygous sites using two factorial Markov chains allows us to model sites that harbor both a CNA event and a somatic mutation. In the situation when the observed variables at one somatic mutation site suggest that the genotype or the subclone assignment at that site disagrees with the neighboring germline heterozygous sites, THEMIS can still infer the correct hidden genotype and subclone assignment based on the observed variables at the somatic mutation site. Furthermore, because there will typically be many contiguous germline heterozygous sites before and after this somatic mutation site, the disagreement will not be propagated to nearby germline heterozygous sites.

We adopted these particular modeling choices and assumptions based on the sequencing quality and depth in our data. However, we encourage users to adjust these modeling choices and assumptions as appropriate for their own data. The extensible modeling platform employed by THEMIS should make it easy to implement variants of the model proposed here.

Structure of the THEMIS model

The THEMIS dynamic graphical model can be represented using a standard “plate” representation. In Fig. 4, the “Prologue” represents the start of the model, and the “Chunk” represents all random variables associated with a single genomic position. In practice, the chunk is copied multiple times so that the length of the model matches the length of the observed data (i.e., the total number of genomic positions). In the figure, each vertex represents one random variable at a particular genomic position. If a vertex is shaded, then the corresponding random variable is observed; otherwise, the random variable is hidden. The variables and parameters used in the model are explained in the next two sections and summarized in Supplementary Table 5. We use capital letters to denote random variables and the corresponding lower-case letters to denote the particular values of the random variables. If a lower case letter has a bar on top, then it is observed; otherwise, it is inferred.

Figure 4
figure 4

The THEMIS model.

The backbone of the model consists of two Markov chains, i.e., two hidden variables at each site t, corresponding to the unknown genotype of the mutation (G t ) and an indicator (Z t ) of which clone the mutation occurs in. The two Markov chains capture the phenomenon that CNA events span multiple contiguous genomic positions, and the corresponding most probable states that we infer from the observed variables are essentially the output of our THEMIS model. At site t and in biopsy m, a set of observed variables represent the allelic ratio (A m,t ) and the log ratio between tumor read depth and normal read depth (L m,t ). Other useful information about the genomic position is also captured in the model as observed variables, including the type of site (D t ), an indicator variable for the first site of a chromosome (S t ), and the distance from its previous site (H t ). For each biopsy m, the model contains an additional set of |Z| hidden variables, \({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{|Z|}\), denoting the prevalence levels of the clones.

The conditional independence relationships among the variables are encoded by the edges either within a genomic position or between two adjacent genomic positions. At each site, the model specifies the probability of the observed variables given the hidden variables, which captures how different genotypes and the occurrence in different clones, in combination, make the allelic ratio and log ratio different in tumor biopsies. At a germline heterozygous site (where D t  = 0), the allelic ratio reflects how the allelic ratio is different from 0.5, which is expected in a normal cell. At a somatic mutation site (where D t  = 1), the allelic ratio reflects how the allelic ratio is different from 0, which is expected in a normal cell. Therefore, the parents of the allelic ratio (A m,t ) include the genotype (G t ), the clone index (Z t ), the prevalence levels (\({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{|Z|}\)) and the type of site (D t ). Because the log ratio of the tumor-normal read depth difference (L m,t ) does not depend on the type of site, its parents include the genotype (G t ), the clone index (Z t ), and the prevalence levels (\({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{|Z|}\)). Between any two adjacent sites, we specify the transition probability between genotypes and the transition probability between clones. The variable H t is the distance (in base pairs) between site t and its previous site t − 1. We set transition parameters (for both G t and Z t ) as functions of \({\bar{h}}_{t}\) (the observed value of H t ) to capture the phenomenon that the chance G t and G t − 1 agree decreases as \({\bar{h}}_{t}\) increases (and similarly for Z t and Z t − 1). Therefore, H t is a parent of G t and Z t at a non-start-of-a-chromosome site t. When the current site is the start of a chromosome (S t = 1), G t and Z t do not depend on G t − 1 and Z t − 1, but follow prior distributions (π G and π Z ). Therefore, S t is also a parent of G t and Z t .

Variables in the THEMIS model

The variables in the THEMIS model are either observed variables or hidden variables. The observed variables are directly obtained from the data, whereas the most probable states of the hidden variables must be inferred, given the observed variables and the trained parameters. Each frame of the THEMIS model contains five observed variables. Two of these are key signals to detect mutations, and they are modeled as Gaussians:

  • The allelic ratio A m,t at site t in biopsy m (m = 1, …, M) is modeled as a Gaussian variable, i.e.,

    $$P({A}_{m,t}=a|{Z}_{t}=z,{G}_{t}=g,{P}_{m}^{z}={p}_{m}^{z},{D}_{t}=\bar{d})={\mathscr{N}}(a;{\mu }_{A,m,g,z,{p}_{m}^{z}},{\sigma }_{A,m,\bar{d}}^{2})\mathrm{.}$$
    (1)

    The mean parameter \({\mu }_{A,m,g,z,{p}_{m}^{z}}\) associated with this Gaussian is set to

    $${\mu }_{A,m,g,z,{p}_{m}^{z}}=\frac{{n}_{g}^{alt}{p}_{m}^{z}+{n}_{N}^{alt}\mathrm{(1}-{p}_{m}^{z})}{{n}_{g}{p}_{m}^{z}+{n}_{N}\mathrm{(1}-{p}_{m}^{z})},$$
    (2)

    where n g is the DNA copy number in tumor cells with genotype g, n N is the DNA copy number in normal cells, \({n}_{g}^{alt}\) is the copy number of the alternative allele in tumor cells with genotype g, and \({n}_{N}^{alt}\) is the copy number of the alternative allele in normal cells (Supplementary Table 6). The mean parameter \({\mu }_{A,m,g,z,{p}_{m}^{z}}\) is not estimated from data, but determined by the states of the hidden variables G t , Z t , and \({P}_{m}^{{Z}_{t}}\) and the observed \({\bar{d}}_{t}\). The variance parameter \({\sigma }_{A,m,\bar{d}}^{2}\), however, is estimated from data.

  • The log ratio of tumor-normal read depth at site t in biopsy m (m = 1, …, M), denoted by L m,t , is modeled as a Gaussian variable, i.e.,

    $$P({L}_{m,t}=l|{Z}_{t}=z,{G}_{t}=g,{P}_{m}^{z}={p}_{m}^{z})={\mathscr{N}}(l;{\mu }_{L,m,g,z,{p}_{m}^{z}},{\sigma }_{L,m}^{2})\mathrm{.}$$
    (3)

    The mean parameter \({\mu }_{L,m,g,z,{p}_{m}^{z}}\) is set to

    $${\mu }_{L,m,g,z,{p}_{m}^{z}}={\mathrm{log}}_{{\rm{2}}}\frac{{n}_{g}{p}_{m}^{z}+{n}_{N}\mathrm{(1}-{p}_{m}^{z})}{{n}_{N}}+{c}_{m},$$
    (4)

    where n N is copy number in normal cells (set to be 2 by default), and n g is the DNA copy number in tumor cells with genotype g. The parameter c m captures the sequencing depth difference in the tumor biopsy and the normal biopsy and the read number discrepancy due to ploidy change in the tumor biopsy. Therefore, the mean parameter \({\mu }_{L,m,g,z,{p}_{m}^{z}}\) is also not estimated from data, but determined by the states of the hidden variables G t , Z t , and \({P}_{m}^{{Z}_{t}}\). The variance parameter \({\sigma }_{L,m}^{2}\) again is estimated from data.

    The remaining three observed variables provide information about the current genomic position, and they are discrete:

  • The variable H t is the distance (in base pairs) between site t and site t − 1. The effect of H t is interesting since, while there is an equal number of graphical model frames between any two sites, the actual duration between them still effects the statistics of the underlying Markov chains since both G t and Z t directly depend on H t . THEMIS, therefore, expresses a kind of irregularly spaced dynamic graphical model within the frame work of a regularly spaced dynamic graphical model. Moreover, THEMIS does this more efficiently than an alternative where the number of graphical model frames between sites t and t − 1 is proportional to \({\bar{h}}_{t}\), an approach that would be significantly more costly computationally. Supplementary Fig. 5 shows a histogram of \(\mathrm{log}({\bar{h}}_{t}))\) indicating that there is a diverse set of lengths between successive sites — the diversity suggests that inter-site length can have a significant influence on the Markov chains’ transition matrices.

  • The site type D t at site t is a Boolean with D t = 0 denoting a germline heterozygous site and D t = 1 denoting a somatic mutation site which hosts a SNV event.

  • The Boolean variable S t denotes the start of a chromosome. When the current site is the start of a chromosome (S t = 1), G t and Z t do not depend on G t − 1 and Z t − 1, but follow uniform prior distributions.

In addition, each frame of the THEMIS model contains two hidden variables.

  • The genotype G t at site t is a discrete variable, which corresponds to all possible genotypes up to a certain number of copies. We consider all possible genotypes up to five copies (Supplementary Table 6).

  • The clone index variable Z t at site t is a discrete variable of |Z| possible values (i.e., Z t {1, …, |Z|}), where |Z| is pre-specified by the user.

Finally, the model contains a set of hidden variables that are “tied” across frames. For clone z in biopsy m (m = 1, …, M), the prevalence level variable \({P}_{m}^{z}\) is a discrete variable of |P| possible values, where |P| is pre-specified by the user. The default |P| is 20, corresponding to 20 equally spaced prevalence levels between 0 and 1 (i.e. 0.05, 0.10, …, and 1.00). THEMIS does not model clone prevalence as a continuous variable because clone prevalence is a parent of other variables (e.g. allelic ratio), and a hidden continuous variable cannot appear as a parent of other variables in a dynamic graphical model.

Parameters in the THEMIS model

Some parameters in the THEMIS model need to be specified by the user, whereas other parameters are estimated from data. Specifically, the user must specify the following parameters: the number of biopsies used in the analysis (M), the number of subclones (|Z|), the maximum copy number in the mutations (\({c}_{max}^{T}\)), the number of prevalence levels (|P|), and the log-ratio offset in biopsy m due to ploidy and sequencing depth change (c m , m = 1, …, M). In our experiments, we first run THEMIS with an initial estimate of c m derived by examining the bivariate plot of allelic ratio and log ratio at germline heterozygous sites (Supplementary Fig. 6). Specifically, c m is estimated by identifying the center of the normal genotype cluster on the log ratio axis. After running THEMIS, we re-estimate c m as the average log-ratio on the sites whose genotypes are predicted to be “AB” (i.e. no CNA). This new estimate is used in a second run of THEMIS. In practice, the user can also leverage other ploidy estimation tools to get the initial estimate of c m or run THEMIS multiple times with multiple initial estimates and choose with the one with highest likelihood. Three sets of parameters are estimated from data via the standard expectation-maximization (EM) algorithm for dynamic graphical models:

  1. 1.

    The variance of allelic ratio in biopsy m (m = 1, …, M) on germline heterozygous sites, denoted by \({\sigma }_{A,m\mathrm{,0}}^{2}\), the variance of allelic ratio in biopsy m (m = 1, …, M) on somatic mutation sites, denoted by \({\sigma }_{A,m\mathrm{,1}}^{2}\), and the variance of log ratio in biopsy m (m = 1, …, M) on each site, denoted by \({\sigma }_{L,m}^{2}\).

  2. 2.

    The transition probability from genotype j at site t − 1 to genotype i at site t (i, j {1, …, |G|), denoted by \({Q}_{G}(i,j;{\bar{h}}_{t})\) and transition probability from clone j at site t − 1 to clone i at site t (i, j {1, …, |Z|), denoted by \({Q}_{Z}(i,j;{\bar{h}}_{t})\). We model \({Q}_{G}(i,j;{\bar{h}}_{t})\) and \({Q}_{Z}(i,j;{\bar{h}}_{t})\) as parametric functions of \({\bar{h}}_{t}\), the distance (in base pairs) between site t and site t − 1. We first define the probability of staying at the same genotype j, denoted by \({\rho }_{G}(j;{\bar{h}}_{t})\), as

    $${\rho }_{G}(j;{\bar{h}}_{t})=\frac{{\mathscr{N}}(\mathrm{log}\,{\bar{h}}_{t};0,{\sigma }_{G,j}^{2})}{{\mathscr{N}}\mathrm{(0};0,{\sigma }_{G,j}^{2})}\frac{|G|-1}{|G|}+\frac{1}{|G|},$$
    (5)

    where \({\mathscr{N}}(x;0,{\sigma }_{G,j}^{2})\) is the probability density of a Gaussian distribution with mean 0 and variance \({\sigma }_{G,j}^{2}\) at the point x. Then

    $${Q}_{G}(i,j;{\bar{h}}_{t})=(\begin{array}{ll}{\rho }_{G}(j;{\bar{h}}_{t}), & \,\mathrm{if}\,\,i=j\\ \frac{1-{\rho }_{G}(j;{\bar{h}}_{t})}{|G|-1}, & \,{\rm{otherwise}}\mathrm{.}\end{array}$$
    (6)

    Similarly, we define

    $${\rho }_{Z}(j;{\bar{h}}_{t})=\frac{{\mathscr{N}}(\mathrm{log}\,{\bar{h}}_{t};0,{\sigma }_{Z,j}^{2})}{{\mathscr{N}}\mathrm{(0};0,{\sigma }_{Z,j}^{2})}\frac{|Z|-1}{|Z|}+\frac{1}{|Z|},$$
    (7)

    and

    $${Q}_{Z}(i,j;{\bar{h}}_{t})=(\begin{array}{cc}{\rho }_{Z}(j;{\bar{h}}_{t}), & \,\mathrm{if}\,i=j\\ \frac{1-{\rho }_{Z}(j;{\bar{h}}_{t})}{|Z|-1}, & \,{\rm{otherwise}}\,\mathrm{.}\end{array}$$
    (8)

    Note that we estimate \({\sigma }_{G,j}^{2}\) (j = 1, …, |G|) and \({\sigma }_{Z,j}^{2}\) (j = 1, …, |Z|) from data in a maximum likelihood fashion. The way we parameterize the transition probabilities captures the phenomenon that the probability of staying at the same genotype decreases as a monotone function of the distance from the previous site, and the rate it decreases is parametrized by \({\sigma }_{G,j}^{2}\). Therefore, we estimate the decreasing speed (i.e. \({\sigma }_{G,j}^{2}\)) adaptively from the data (unlike user-prespecified transition probabilities in TITAN), and the speed is different for different genotypes since different mutation events may occur with different lengths on the genome (unlike the tied transition probabilities in TITAN).

  3. 3.

    The prior distributions of genotypes, clones and the cell prevalence levels of the clones (denoted by π G , π Z and π P , respectively). These prior distributions are responsible for the frames that correspond to the starts of the chromosomes. These prior distributions are initialized as uniform distributions, and trained from the data along with other parameters.

We use Ω to denote the set of parameters in the joint distribution specified by the model, namely \({\rm{\Omega }}=\{{\sigma }_{A,m\mathrm{,0}}^{2},{\sigma }_{A,m\mathrm{,1}}^{2},{\sigma }_{L,m}^{2},{\sigma }_{G,j}^{2},{\sigma }_{Z,j}^{2},{\pi }_{G},{\pi }_{Z},{\pi }_{P},\forall m=\mathrm{1,}\ldots ,M,\forall j=\mathrm{1,}\ldots ,|Z|\}\). In the estimation step, we use the EM algorithm to estimate the parameters and find a (local) maximum, denoted by \(\hat{{\rm{\Omega }}}\), for

$$\mathop{{\rm{argmax}}}\limits_{{\rm{\Omega }}}{\rm{\Pr }}({A}_{\mathrm{1:}M\mathrm{,1:}T}={\bar{a}}_{\mathrm{1:}M\mathrm{,1:}T},{L}_{\mathrm{1:}M\mathrm{,1:}T}={\bar{\ell }}_{\mathrm{1:}M\mathrm{,1:}T},{D}_{\mathrm{1:}T}={\bar{d}}_{\mathrm{1:}T},{S}_{\mathrm{1:}T}={\bar{S}}_{\mathrm{1:}T},{H}_{\mathrm{2:}T}={\bar{h}}_{\mathrm{2:}T};{\rm{\Omega }}\mathrm{).}$$
(9)

In the inference step, we infer the most probable states of the hidden variables given the estimated parameters \(\hat{{\rm{\Omega }}}\), i.e.,

$$\begin{array}{c}\mathop{{\rm{argmax}}}\limits_{{g}_{\mathrm{1:}T},{z}_{\mathrm{1:}T},{p}_{\mathrm{1:}T}^{z}}{\rm{\Pr }}({A}_{\mathrm{1:}M\mathrm{,1:}T}={\bar{a}}_{\mathrm{1:}M\mathrm{,1:}T},{L}_{\mathrm{1:}M\mathrm{,1:}T}={\bar{\ell }}_{\mathrm{1:}M\mathrm{,1:}T},{D}_{\mathrm{1:}T}={\bar{d}}_{\mathrm{1:}T},{S}_{\mathrm{1:}T}={\bar{S}}_{\mathrm{1:}T},\\ \quad \quad \quad \quad \quad \quad {H}_{\mathrm{2:}T}={\bar{h}}_{\mathrm{2:}T},{G}_{\mathrm{1:}T}={g}_{\mathrm{1:}T},{Z}_{\mathrm{1:}T}={z}_{\mathrm{1:}T},{P}_{\mathrm{1:}M\mathrm{,1:}T}^{z}={p}_{\mathrm{1:}M\mathrm{,1:}T}^{z};\hat{{\rm{\Omega }}}\mathrm{).}\end{array}$$
(10)

The inferred most probable sequence of hidden variables, as the output of our algorithm, provide the heterogeneity analysis results, i.e., a number of subclones and their genotypes and cell prevalences (Fig. 1b).

Selecting the number of subclones

THEMIS requires the user to specify the number of subclones in the biopsies before running the model. There are three ways of identifying the number of subclones from the data. The first method is a naive visualization method. If the biopsy is well sequenced, then the number of subclones can be directly identified from the bivariate plot (allelic ratio against log ratio at germline heterozygous sites) of the biopsy by observing different prevalence levels of the LOH events. We take one tumor biopsy (biopsy B1 in ITOMIC study) as an example, whose bivariate plot is provided in Supplementary Fig. 6. It can be observed that there are two major LOH prevalence levels in the plot. Therefore, we can assume that there are two tumor subclones in the biopsy.

Another way of choosing the number of subclones is to use the Bayesian information criterion (BIC)16. BIC is defined as

$$BIC=\,-2\,\mathrm{ln}\,L+k\,\mathrm{ln}(n),$$
(11)

where ln L is the log likelihood of the data, k is the degree of freedom, and n is the number of data points. We choose the subclone number which produces the smallest BIC score. When we run the 2-subclone model and the 3-subclone model on the tumor biopsy B1, BIC scores are −476,725 and −464,710, respectively. Therefore, we can assume there are two tumor subclones in the biopsy based on the BIC scores.

A third way of choosing the number of subclones is to use cross-validation. Suppose that we use three-fold cross-validation. We randomly partition the chromosomes into three sets. In each training-testing split, we use the data from two sets as the training data and the remaining set as the testing data. With the estimated parameters from training data, we run Viterbi algorithm on the testing data, and choose the subclone number which produces the largest averaged log-likelihood (a.k.a., the Viterbi score in GMTK) on testing data. In both simulated data (in Simulations 1 and 2, we simulated two tumor subclones and three tumor subclones, respectively) and real data (biopsy B1 in the ITOMIC study), we observed that the three methods provide the correct results (Supplementary Table 7).

In practice, one may use any of the three methods or a combination of the three methods to set the number of subclones. Note that although the naive visualization method is straightforward, it may produce inaccurate estimates if the sequencing depth of the biopsy is low or when the prevalence levels of two subclones are close to each other. Cross-validation is more robust compared with BIC, but requires additional computational cost. When BIC and cross-validation are being used, we recommend starting with a small number of subclones (e.g. 2) and increase the number until the evaluation criteria deteriorate. For example, if a 2-subclone model produces a lower BIC (or higher averaged ln L in cross-validation) than a 3-subclone model, it is not necessary to run the 4-subclone model.

Joint analysis of multiple biopsies from the same patient

The joint analysis of multiple biopsies in THEMIS is done by first enumerating candidate phylogenetic trees, encoding each tree in the conditional probability tables associated with variables A m,t and L m,t , and then selecting the tree whose associated model yields the highest likelihood. During the enumeration phase, we make three assumptions. First, we assume that we have the statistical power to discern all the clones from individual biopsies and estimate their prevalences. Second, we assume that we can identify shared clones between biopsies by computing and thresholding similarities between the clones. Third, if the sum of the prevalences (p a and p b ) of clones a and b is greater than 1.0 in at least one biopsy, and p a  > p b in all biopsies where clones a and b are present, then we assume that clone a is an ancestor of clone b. The first two assumptions ensure that the ground truth structure is contained in the candidate structures. The third assumption helps us reduce the number of candidate structures. Users are also encouraged to use other information, such as the time and physical locations of the biopsies, to eliminate candidate structures. Joint analysis over multiple biopsies can be carried out in the following five steps.

  1. 1.

    Analyze biopsies separately with THEMIS and identify the genotype and prevalence of each clone within each biopsy.

  2. 2.

    Compute similarities between all pairs of clones from different biopsies and merge similar clones.

  3. 3.

    Identify consistent parent-child relationships based on the individually estimated prevalences using the third assumption above.

  4. 4.

    Enumerate all phylogenies consistent with those relationships and run THEMIS accordingly.

  5. 5.

    Select the phylogeny with maximum likelihood.