Abstract
A comprehensive characterization of tumor genetic heterogeneity is critical for understanding how cancers evolve and escape treatment. Although many algorithms have been developed for capturing tumor heterogeneity, they are designed for analyzing either a single type of genomic aberration or individual biopsies. Here we present THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows for the joint analysis of different types of genomic aberrations from multiple biopsies taken from the same patient, using a dynamic graphical model. Simulation experiments demonstrate higher accuracy of THEMIS over its ancestor, TITAN. The heterogeneity analysis results from THEMIS are validated with single cell DNA sequencing from a clinical tumor biopsy. When THEMIS is used to analyze tumor heterogeneity among multiple biopsies from the same patient, it helps to reveal the mutation accumulation history, track cancer progression, and identify the mutations related to treatment resistance. We implement our model via an extensible modeling platform, which makes our approach open, reproducible, and easy for others to extend.
Introduction
Cancer is heterogeneous in the sense that the cancer cells in a tumor are not genetically identical, but form distinct clones, defined as subpopulations of cancer cells that host the same genomic aberrations. In aggressive and metastatic cancers, these genomic aberrations quickly evolve, resulting in extreme spatial and temporal heterogeneity^{1,2}. Therefore, multiple biopsies over different locations and at different time points need to be collected and sequenced in order to capture the complexity of tumor genomic landscapes and provide insight into how tumors evolve and escape treatment^{3,4}. Accordingly, computational tools are needed to accurately characterize the clonal structure of cancer and reveal how that structure evolves over time.
In recent years, a large number of computational tools and statistical models have been developed to analyze tumor heterogeneity from DNA sequencing data (Table 1). However, most of these tools only model one type of genomic aberration, such as singlenucleotide variants (SNVs), copy number alterations (CNAs), or structural variants. Restricting the analysis to a single type of mutation not only reduces statistical power to accurately detect the clonal structure within the tumor, but also prevents us from understanding interactions among different types of mutations. Furthermore, many SNVbased methods assume that no copy number changes have occurred, which is extremely improbable. Therefore, their estimation of the prevalence of a given clone can be inaccurate, and the corresponding heterogeneity results may be misleading. Existing methods that capture SNVs and CNAs in the same model (i.e., phyloWGS^{5}, SPRUCE^{6} and Canopy^{7}) require running a CNAcalling algorithm before heterogeneity analysis, but accurate CNA characterization also depends on heterogeneity analysis.
Most existing tools are designed to analyze a single tumor biopsy and are not suitable for jointly analyzing multiple biopsies. As DNA sequencing becomes more affordable, we can more easily collect multiple biopsies from a single patient during treatment. If we only perform heterogeneity analysis on the individual biopsies, then we are unable to detect clones that are shared across different biopsies from the same patient, and we fail to address important questions about how the tumor cells evolve, metastasize and escape treatment.
Finally, although most models are free and publicly available, it is difficult to extend them by adding new assumptions and new types of biological data. Even under the best of circumstances, significant effort is required for users to fully understand the source code. In many situations, data structures and computational algorithms prohibit other investigators from modifying the model to accommodate their special needs.
To address these challenges, we propose THEMIS (Tumor Heterogeneity Extensible Modeling via an Integrative System), which allows us to jointly characterize different types of genomic aberrations from multiple biopsies using a dynamic graphical model. We implement our model via an extensible modeling platform, the Graphical Models Toolkit (GMTK)^{8}, which makes our approach open, reproducible and easy for others to extend. To extend the model, users only have to modify the model specification files; GMTK then automatically handles the required computation. Simulation experiments demonstrate that THEMIS significantly increases the accuracy of recovering tumor subclones and their genotypes, compared with its ancestor, TITAN^{9}. Single cell DNA sequencing confirms that individual nuclei can be segregated into one of the two tumor subclones identified by THEMIS. We applied THEMIS to three tumor biopsies from one cancer patient, thereby revealing the mutation accumulation history of the patient, tracking cancer progression, and identifying mutations related to developing resistance following various treatments.
Results
The Model
From bulk next generation sequencing data, we define two primary observations at each genomic position: the allelic ratio, defined as the proportion of the reads containing a specified allele among all reads aligned to the site, and the log ratio between tumor read depth and normal read depth (Fig. 1a). From these inputs, we aim to infer the number of distinct clones, the full genotype of each clone, and the prevalence of each clone within each biopsy (Fig. 1b). To carry out this inference, TITAN^{9} uses a dynamic graphical model, in which each frame represents one genomic position, and the allelic ratio and tumor/normal log ratio are observed at each frame. The backbone of the TITAN model consists of two hidden Markov chains, one representing the genotype of the CNA event at the current position, and the other representing the clone in which the CNA event occurs. Our model, THEMIS, is similar to TITAN in the sense that both models are dynamic graphical models with each frame representing a single genomic position, with CNA events captured by hidden Markov chains. However, THEMIS extends TITAN by (1) jointly accounting for SNVs and CNAs, (2) jointly analyzing multiple biopsies, (3) estimating transition probabilities between hidden states of the model from observed data rather than fixing them at specific values, and (4) using an open and extensible modeling language (GMTK^{8}). More details about the THEMIS model, including the modeling choices and assumptions, model’s structure, variables and parameters, are provided in Methods section.
Simulation results
We first used simulated data to compare the performance of THEMIS and TITAN^{9}. As a starting point for the simulation, we used the genomic positions measured in three tumor biopsies from three patients with triple negative breast cancer (Supplementary Table 1). For each set of genomic positions, we also specified three different sets of tumor subclone compositions. More details about the simulation experiments are provided in Supplementary Note 1. We evaluated (1) the percentage of sites at which the hidden genotype was incorrectly inferred, (2) the percentage of sites at which the clonal/subclonal status was incorrectly inferred, and (3) the percentage of sites at which either the genotype or clonal/subclonal status were incorrectly inferred. THEMIS outperformed TITAN in recovering the clonal/subclonal status and genotypes of the genomic positions in all experiments (twosided paired ttest, p = 0.00137 for genotype recovery, p = 1.152 × 10^{−7}) for clonal/subclonal status recovery, and p = 0.00198 for both genotype and clonal/subclonal status recovery), and reduced the recovery error by 13.3% on average (Supplementary Table 2). Not surprisingly, both THEMIS and TITAN performed better when the prevalence of the somatic events was higher.
Validation via singlecell DNA sequencing
The two tumor clones in Fig. 1b were identified from the bulk DNA sequencing data from an involved axillary node in a patient with metastatic triple negative breast cancer. We singlecell sequenced a second sample taken from the same axillary node at the same time, aiming to validate the subclones previously identified from bulk DNA sequencing. From a total number of 143,108 nuclei, 96 nuclei were fluorescence activated cell sorted (FACS) using gating to select for tumor cell nuclei, placed on a 96well plate, and whole genome amplified (GenomiPhi). Indexed Nextera libraries were sequenced on the NextSeq using PE 150 bp midoutput flow cell. Six cells were removed due to extremely low numbers of reads after adapter removal (Supplementary Fig. 1). The sequencing data from another eleven nuclei was of low quality, as evidenced by a much larger fraction of short reads (Supplementary Fig. 2) and were excluded from our analysis. The remaining 79 cells were used for validation.
We use a Bayesian classifier (Supplementary Note 2) to assign each of the 79 cells into one of the three clones identified from the bulk sequencing data by THEMIS—one normal clone, one parent tumor clone and one child tumor clone. As input to the classifier, we use sequencing coverage on three types of genomic regions which are derived from the inferred genomewide genotype for the two tumor clones (Fig. 1b), namely, clonal 2copy (i.e. normal) regions, clonal 1copy (i.e. loss of heterozygosity [LOH]) regions and subclonal 1copy regions (Fig. 2b and Supplementary Table 3). Clonal 2copy regions provide the baseline measurement of sequencing coverage in normal regions. Clonal LOH regions distinguish tumor from normal. Subclonal LOH regions distinguish the child tumor clone from its parent. The Bayesian classifier identifies 2 normal cells (indicating that FACS gating worked), 57 parent tumor cells and 20 child tumor cells (Fig. 2a and Supplementary Table 4). The observed ratio of parent to child tumor cells (2.85) does not agree with the ratio inferred by THEMIS (1.14), which is likely attributable to the very small number of examined events and that the single cell analysis was performed in a separate sample taken at the same time. However, the histograms of the normalized coverage rate in the 99 clonal LOH segments and 119 subclonal LOH segments in the three types of cells validate that the three categories of nuclei agree with our model (Fig. 2c). For example, cell 34 has no LOH events in the clonal LOH region nor in the subclonal LOH region, and is therefore a normal cell. Cell 26 displays LOH events in the clonal LOH region, but not in the subclonal LOH region, and is therefore identified as belonging to the parent tumor clone. Cell 1 has LOH events both in clonal and subclonal LOH regions, and is therefore identified as belonging to the child tumor clone. The histograms of the coverage rates from the aggregated 57 parent tumor cells show similar patterns to that of cell 26, whereas the histograms from the aggregated 20 child tumor cells show similar patterns as cell 1 (Supplementary Fig. 3). Therefore, our single cell experiment successfully validates the subclones identified by THEMIS.
Joint analysis over multiple biopsies from the ITOMIC study
The Intensive Trial of OMics In Cancer (ITOMIC001) enrolls patients with metastatic triple negative breast cancer in whom biopsies of multiple metastatic sites are performed repeatedly over time^{10}. For each patient, multiple biopsies are evaluated using next generation sequencing. We performed joint heterogeneity analysis on three biopsies from Patient 1 (Fig. 3a). Patient 1 was originally diagnosed with triplenegative breast cancer in February 2011, and enrolled in ITOMIC001 in October 2013. The patient then received three different treatments, including cisplatin (between study days 12 and 125), an investigational PARP inhibitor veliparib (between study days 126 and 194), and the kinase inhibitor ponatinib (from study day 195 until the time of her death on study day 250). The first biopsy B1 was sampled from an involved right axillary lymph node, collected on study Day 7 (before cisplatin). The second biopsy B2 was sampled from the same right axillary lymph node on study Day 125 (after cisplatin). The third sample B3 was from a left peribronchial lymph node, collected at autopsy following the patients death on study day 250 (post ponatinib).
We followed a fivestep procedure to recover the phylogenetic tree from the three tumor biopsies (see Methods for details), which involves enumerating all possible candidate phylogenies from individual biopsy analysis and then selecting the best phylogeny by likelihood (Supplementary Note 3). The recovered phylogenetic tree (Fig. 3b) reveals the relationships among the recovered clones and the mutations accumulated at each stage of cancer progression. Parent clone A is shared by all three biopsies, and child clones (B, C, and D) inherit the mutations present in parent clone A and acquire new mutations of their own. Although clones B, C and D occur in separate biopsies, clones C and D appear to be descendants of clone B, meaning that the new mutations detected in later biopsies occurred within the child tumor clones. In addition to clones A, B, C and D, the phylogenetic tree includes an internal node for an inferred intermediate clone (CD) hosting the mutations shared between clone C and clone D, and corresponding to a splitting point between clones C and D in cancer progression. On the phylogenetic tree, we also label the new CNAs and SNVs on the edges. We visualize these mutations on a genomewide plot, and provide the number of germline heterozygous sites affected by these CNAs and the number of the SNVs (Fig. 3b). The mutations on the edges of the phylogenetic tree reveal the mutation accumulation history of this patient and can help in tracking mutations related to treatment resistance. By the time the patient joined the study, there had been two major phases of mutation accumulation, one corresponding to the mutations accumulated in clone A and the other corresponding to mutations accumulated in clone B. Comparing the two phases, more SNVs emerged in the second phase. After joining the study and receiving further treatments, additional mutations emerged. The mutations on the edges between clone B and clone C emerged during treatment with cisplatin, whereas those on the edge between clone CD and clone D emerged during treatment with the PARP inhibitor veliparib (without response) followed by treatment with the kinase inhibitor ponatinib (which did yield a partial response). Because veliparib failed to affect tumor growth, we attribute the changes associated with the CD → D transition to ponatinib, which was given based on the presence of two activating mutations affecting FGFR2 (S252W;Y375C) (manuscript submitted).
A number of intriguing patterns were revealed when we looked at the mutations associated with different phases of treatment. First, many of the genes in CNA regions are known to be related to cancer. On the CNAs associated with the three treatments, including 147 CNAs during the treatment with cisplatin and 98 CNAs during the treatment with veliparib and ponatinib, we identify 848 genes and 519 genes, respectively. We queried these genes on the NCBI gene website (www.ncbi.nlm.nih.gov/gene) and retrieved 186 genes and 175 genes, respectively, that are known to be related to cancer in the literature. The retrieved genes are related to many important cancer signaling pathways (Fig. 3c). On some of the cancer signaling pathways (i.e., MEK/MAPK/Erk, PI3K/Akt/mTor, NFκB, and p53), core genes (i.e., MAP2K7, PIK3C2A, PIK3CD, PIK3CB, TNFSF9, TNFSF14, NRAS, GSK3B, Notch3, TP53TG5 and SNAI1) experienced copynumber changes during different phases of treatment. In addition, the genes mutated during different phases of treatment show patterns that are potentially illustrative of different therapeutic responses. The proportion of genes experiencing copy number gains in later stages (i.e., on the edge CD → D, during the treatments with veliparib and ponatinib) is much higher than that in earlier stage (on the edges B → CD and CD → C, during the treatment with cisplatin), and the proportion difference is more dramatic in important cancer signaling pathways, including MEK/MAPK/Erk, PI3K/Akt/mTor and NFκB. These genes mutated in the different phases of treatment also showed different functional focuses. According to DAVID^{11,12}, the top function clusters among the genes mutated in the earlier stage are rho GTPase activation, growth factor, DNA damage and ErbB signaling. The top function clusters from the genes mutated in the later stage are DNA damage, ras signaling, nucleotidebinding, zincfinger, neurotrophin signaling, and endocytosis. Third, a number of SNVs occurred on or near the genes known to be related to cancer signaling pathways, which allows us to investigate them together with the genes mutated due to copy number changes. During treatment with cisplatin, one SNV occurred near an intron/extron boundary within BIN1, which is known to interact with the myc oncoprotein as a putative tumor suppressor. During the treatments with veliparib and ponatinib, SNVs occurred on or near LATS1 (a core component of HippoYAP pathway), MGMT (related to DNA damage), IL17RB (related to NFκB signaling) and APCDD1L (related to wnt signaling), all of which are known to be related to breast cancer. Although additional experiments are needed for further validation, THEMIS provides a powerful computational tool to generate hypotheses from multiple biopsy DNA sequencing data.
Discussion
THEMIS offers a powerful and extensible modeling framework to jointly capture different types of genomic aberrations in the analysis of multiple biopsies. The integration of CNAs and SNVs in the heterogeneity analysis increases the accuracy of clonal inference relative to previous methods that consider only single types of mutations. For example, if we observe an allelic ratio 0.3 at one genomic position, then the cell prevalence of the SNV should be 60% if the corresponding genotype is AB, but the prevalence should be 85.7% if the genotype is AAB. In such cases, methods that fail to jointly consider copy number information and SNVs can be misled. In addition, the integration of multiple types of mutations allows us to understand cancer comprehensively, and to address important questions such as how the different types of mutations cooperate with each other and what roles they play at different stages of cancer progression.
The joint analysis over multiple biopsies from the same patient provides a complete picture of mutation progression in the patient, which may shed light on how tumor cells escape treatment and metastasize. The ability to analyze multiple biopsies jointly will be increasingly important as DNA sequencing costs continue to decrease. The current turnaround time of analyzing the three biopsies with our model, including both data preprocessing and model running, is just a few hours.
Because THEMIS is built using a general purpose graphical models toolkit, the approach is easy to extend to alternative model architectures. For example, during review of this manuscript, one reviewer suggested that the THEMIS model likely oversegments the genome. We verified this effect empirically and then demonstrated the flexibility of the THEMIS framework by modifying the model to incorporate a userspecified constraint on the number of segments (Supplementary Note 4). In addition, GMTK provides flexible calculation in both estimation and inference, including both exact and approximate inference algorithms. Based on the available computing resources, the user can easily trade memory with running time. Using this modeling and algorithmic flexibility, we plan in future work to extend THEMIS to account for more complex types of mutations, such as chromothripsis and chromoplexy. We also plan to incorporate a more principled phylogeny reconstruction method into THEMIS. Ultimately, THEMIS will provide a testbed for model development by us and others interested in modeling the full complexity of tumor evolution.
Data availability
The bulk DNA sequencing data and the single cell DNA sequencing data used in our analysis can be downloaded from Sequence Read Archive with accession SRP102304.
Software availability
THEMIS is available at https://github.com/jieliu6/THEMIS.
Methods
Data preprocessing
We assume that next generation sequencing data was mapped to the reference genome, and the mapped BAM files are ready for analysis. Preprocessing of the data consists of three steps (Supplementary Fig. 4). First, we identify the genomic sites that will be included in the model. Our model captures both CNA events and SNV events; therefore, two types of genomic sites are included. For CNAs, we consider germline heterozygous sites since we can monitor not only absolute copy number changes (via tumornormal read depth difference), but also what happens to the two individual copies (via allelic imbalance). For SNVs, we consider the somatic mutation sites which host an SNV event in any of the tumor biopsies. From the germline (normal) BAM file, we use Samtools to identify germline heterozygous sites. From the tumor and normal BAM files, we use MuTect^{13} to identify somatic mutation sites. Second, we filter out unreliable sites and reads using MuTect^{13}. Third, we adjust for GC content and mappability. Short reads from next generation sequencers are not uniformly distributed across the genome—more reads are expected to be obtained from regions with higher GC content and mappability. The bias cannot fully be adjusted by normalizing with another next generation sequencing library (e.g. from a normal biopsy) from the same patient^{14}. We therefore use HMMcopy^{15} to adjust GC content and mappability in the read counts.
Modeling choices in THEMIS
Unlike previous methods such as phyloWGS^{5}, SPRUCE^{6} and Canopy^{7}, which capture CNA or SNV events as the entities in the model, our model THEMIS and its predecessor TITAN directly model individual genomic positions as the entities in the model and therefore have the ability to perform CNA calling during tumor heterogeneity analysis. Both THEMIS and TITAN are dynamic graphical models with each frame representing a single genomic position, with CNA events captured by hidden Markov chains. Therefore, THEMIS inherits five key assumptions from TITAN:

1.
Two primary observed variables—allelic imbalance and the tumornormal read depth ratio—reflect the underlying somatic genotype of the tumor at germline heterozygous sites.

2.
CNA events span multiple contiguous germline heterozygous sites.

3.
The observed NGS data comes from heterogeneous cellular populations, including normal cells and tumor subpopulations.

4.
Two mutation events are observed at the same cellular prevalence if and only if the two events come from the same subpopulation.

5.
Only one CNA event can arise in only one tumor subpopulation at each genomic position.
Note that Assumption 4, although used by many tumor heterogeneity models, can be invalid if two different tumor subclones in a tumor have the same cellular prevalence. The purpose of introducing Assumption 5 is to make the heterogeneity model simple and identifiable; however, this assumption does prevent us from modeling more complicated situations in which multiple CNAs arise in the same genomic region.
We usually have around 30–50 thousand germline heterozygous sites and several hundred somatic mutation sites in wholeexome sequencing data from a single biopsy. With reasonable sequencing depth (greater than ∼100 reads per position, on average) the underlying genotypes (i.e. the type of the CNA event) estimated from the contiguous germline heterozygous sites can be inferred accurately. Integrating the somatic mutation sites and germline heterozygous sites using two factorial Markov chains allows us to model sites that harbor both a CNA event and a somatic mutation. In the situation when the observed variables at one somatic mutation site suggest that the genotype or the subclone assignment at that site disagrees with the neighboring germline heterozygous sites, THEMIS can still infer the correct hidden genotype and subclone assignment based on the observed variables at the somatic mutation site. Furthermore, because there will typically be many contiguous germline heterozygous sites before and after this somatic mutation site, the disagreement will not be propagated to nearby germline heterozygous sites.
We adopted these particular modeling choices and assumptions based on the sequencing quality and depth in our data. However, we encourage users to adjust these modeling choices and assumptions as appropriate for their own data. The extensible modeling platform employed by THEMIS should make it easy to implement variants of the model proposed here.
Structure of the THEMIS model
The THEMIS dynamic graphical model can be represented using a standard “plate” representation. In Fig. 4, the “Prologue” represents the start of the model, and the “Chunk” represents all random variables associated with a single genomic position. In practice, the chunk is copied multiple times so that the length of the model matches the length of the observed data (i.e., the total number of genomic positions). In the figure, each vertex represents one random variable at a particular genomic position. If a vertex is shaded, then the corresponding random variable is observed; otherwise, the random variable is hidden. The variables and parameters used in the model are explained in the next two sections and summarized in Supplementary Table 5. We use capital letters to denote random variables and the corresponding lowercase letters to denote the particular values of the random variables. If a lower case letter has a bar on top, then it is observed; otherwise, it is inferred.
The backbone of the model consists of two Markov chains, i.e., two hidden variables at each site t, corresponding to the unknown genotype of the mutation (G _{ t }) and an indicator (Z _{ t }) of which clone the mutation occurs in. The two Markov chains capture the phenomenon that CNA events span multiple contiguous genomic positions, and the corresponding most probable states that we infer from the observed variables are essentially the output of our THEMIS model. At site t and in biopsy m, a set of observed variables represent the allelic ratio (A _{ m,t }) and the log ratio between tumor read depth and normal read depth (L _{ m,t }). Other useful information about the genomic position is also captured in the model as observed variables, including the type of site (D _{ t }), an indicator variable for the first site of a chromosome (S _{ t }), and the distance from its previous site (H _{ t }). For each biopsy m, the model contains an additional set of Z hidden variables, \({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{Z}\), denoting the prevalence levels of the clones.
The conditional independence relationships among the variables are encoded by the edges either within a genomic position or between two adjacent genomic positions. At each site, the model specifies the probability of the observed variables given the hidden variables, which captures how different genotypes and the occurrence in different clones, in combination, make the allelic ratio and log ratio different in tumor biopsies. At a germline heterozygous site (where D _{ t } = 0), the allelic ratio reflects how the allelic ratio is different from 0.5, which is expected in a normal cell. At a somatic mutation site (where D _{ t } = 1), the allelic ratio reflects how the allelic ratio is different from 0, which is expected in a normal cell. Therefore, the parents of the allelic ratio (A _{ m,t }) include the genotype (G _{ t }), the clone index (Z _{ t }), the prevalence levels (\({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{Z}\)) and the type of site (D _{ t }). Because the log ratio of the tumornormal read depth difference (L _{ m,t }) does not depend on the type of site, its parents include the genotype (G _{ t }), the clone index (Z _{ t }), and the prevalence levels (\({P}_{m}^{1},{P}_{m}^{2},\ldots ,{P}_{m}^{Z}\)). Between any two adjacent sites, we specify the transition probability between genotypes and the transition probability between clones. The variable H _{ t } is the distance (in base pairs) between site t and its previous site t − 1. We set transition parameters (for both G _{ t } and Z _{ t }) as functions of \({\bar{h}}_{t}\) (the observed value of H _{ t }) to capture the phenomenon that the chance G _{ t } and G _{t − 1} agree decreases as \({\bar{h}}_{t}\) increases (and similarly for Z _{ t } and Z _{ t − 1}). Therefore, H _{ t } is a parent of G _{ t } and Z _{ t } at a nonstartofachromosome site t. When the current site is the start of a chromosome (S _{ t } = 1), G _{ t } and Z _{ t } do not depend on G _{ t − 1} and Z _{ t − 1}, but follow prior distributions (π _{ G } and π _{ Z }). Therefore, S _{ t } is also a parent of G _{ t } and Z _{ t }.
Variables in the THEMIS model
The variables in the THEMIS model are either observed variables or hidden variables. The observed variables are directly obtained from the data, whereas the most probable states of the hidden variables must be inferred, given the observed variables and the trained parameters. Each frame of the THEMIS model contains five observed variables. Two of these are key signals to detect mutations, and they are modeled as Gaussians:

The allelic ratio A _{ m,t } at site t in biopsy m (∀m = 1, …, M) is modeled as a Gaussian variable, i.e.,
$$P({A}_{m,t}=a{Z}_{t}=z,{G}_{t}=g,{P}_{m}^{z}={p}_{m}^{z},{D}_{t}=\bar{d})={\mathscr{N}}(a;{\mu }_{A,m,g,z,{p}_{m}^{z}},{\sigma }_{A,m,\bar{d}}^{2})\mathrm{.}$$(1)The mean parameter \({\mu }_{A,m,g,z,{p}_{m}^{z}}\) associated with this Gaussian is set to
$${\mu }_{A,m,g,z,{p}_{m}^{z}}=\frac{{n}_{g}^{alt}{p}_{m}^{z}+{n}_{N}^{alt}\mathrm{(1}{p}_{m}^{z})}{{n}_{g}{p}_{m}^{z}+{n}_{N}\mathrm{(1}{p}_{m}^{z})},$$(2)where n _{ g } is the DNA copy number in tumor cells with genotype g, n _{ N } is the DNA copy number in normal cells, \({n}_{g}^{alt}\) is the copy number of the alternative allele in tumor cells with genotype g, and \({n}_{N}^{alt}\) is the copy number of the alternative allele in normal cells (Supplementary Table 6). The mean parameter \({\mu }_{A,m,g,z,{p}_{m}^{z}}\) is not estimated from data, but determined by the states of the hidden variables G _{ t }, Z _{ t }, and \({P}_{m}^{{Z}_{t}}\) and the observed \({\bar{d}}_{t}\). The variance parameter \({\sigma }_{A,m,\bar{d}}^{2}\), however, is estimated from data.

The log ratio of tumornormal read depth at site t in biopsy m (∀m = 1, …, M), denoted by L _{ m,t }, is modeled as a Gaussian variable, i.e.,
$$P({L}_{m,t}=l{Z}_{t}=z,{G}_{t}=g,{P}_{m}^{z}={p}_{m}^{z})={\mathscr{N}}(l;{\mu }_{L,m,g,z,{p}_{m}^{z}},{\sigma }_{L,m}^{2})\mathrm{.}$$(3)The mean parameter \({\mu }_{L,m,g,z,{p}_{m}^{z}}\) is set to
$${\mu }_{L,m,g,z,{p}_{m}^{z}}={\mathrm{log}}_{{\rm{2}}}\frac{{n}_{g}{p}_{m}^{z}+{n}_{N}\mathrm{(1}{p}_{m}^{z})}{{n}_{N}}+{c}_{m},$$(4)where n _{ N } is copy number in normal cells (set to be 2 by default), and n _{ g } is the DNA copy number in tumor cells with genotype g. The parameter c _{ m } captures the sequencing depth difference in the tumor biopsy and the normal biopsy and the read number discrepancy due to ploidy change in the tumor biopsy. Therefore, the mean parameter \({\mu }_{L,m,g,z,{p}_{m}^{z}}\) is also not estimated from data, but determined by the states of the hidden variables G _{ t }, Z _{ t }, and \({P}_{m}^{{Z}_{t}}\). The variance parameter \({\sigma }_{L,m}^{2}\) again is estimated from data.
The remaining three observed variables provide information about the current genomic position, and they are discrete:

The variable H _{ t } is the distance (in base pairs) between site t and site t − 1. The effect of H _{ t } is interesting since, while there is an equal number of graphical model frames between any two sites, the actual duration between them still effects the statistics of the underlying Markov chains since both G _{ t } and Z _{ t } directly depend on H _{ t }. THEMIS, therefore, expresses a kind of irregularly spaced dynamic graphical model within the frame work of a regularly spaced dynamic graphical model. Moreover, THEMIS does this more efficiently than an alternative where the number of graphical model frames between sites t and t − 1 is proportional to \({\bar{h}}_{t}\), an approach that would be significantly more costly computationally. Supplementary Fig. 5 shows a histogram of \(\mathrm{log}({\bar{h}}_{t}))\) indicating that there is a diverse set of lengths between successive sites — the diversity suggests that intersite length can have a significant influence on the Markov chains’ transition matrices.

The site type D _{ t } at site t is a Boolean with D _{ t } = 0 denoting a germline heterozygous site and D _{ t } = 1 denoting a somatic mutation site which hosts a SNV event.

The Boolean variable S _{ t } denotes the start of a chromosome. When the current site is the start of a chromosome (S _{ t } = 1), G _{ t } and Z _{ t } do not depend on G _{ t − 1} and Z _{ t − 1}, but follow uniform prior distributions.
In addition, each frame of the THEMIS model contains two hidden variables.

The genotype G _{ t } at site t is a discrete variable, which corresponds to all possible genotypes up to a certain number of copies. We consider all possible genotypes up to five copies (Supplementary Table 6).

The clone index variable Z _{ t } at site t is a discrete variable of Z possible values (i.e., Z _{ t } ∈ {1, …, Z}), where Z is prespecified by the user.
Finally, the model contains a set of hidden variables that are “tied” across frames. For clone z in biopsy m (∀m = 1, …, M), the prevalence level variable \({P}_{m}^{z}\) is a discrete variable of P possible values, where P is prespecified by the user. The default P is 20, corresponding to 20 equally spaced prevalence levels between 0 and 1 (i.e. 0.05, 0.10, …, and 1.00). THEMIS does not model clone prevalence as a continuous variable because clone prevalence is a parent of other variables (e.g. allelic ratio), and a hidden continuous variable cannot appear as a parent of other variables in a dynamic graphical model.
Parameters in the THEMIS model
Some parameters in the THEMIS model need to be specified by the user, whereas other parameters are estimated from data. Specifically, the user must specify the following parameters: the number of biopsies used in the analysis (M), the number of subclones (Z), the maximum copy number in the mutations (\({c}_{max}^{T}\)), the number of prevalence levels (P), and the logratio offset in biopsy m due to ploidy and sequencing depth change (c _{ m }, ∀m = 1, …, M). In our experiments, we first run THEMIS with an initial estimate of c _{ m } derived by examining the bivariate plot of allelic ratio and log ratio at germline heterozygous sites (Supplementary Fig. 6). Specifically, c _{ m } is estimated by identifying the center of the normal genotype cluster on the log ratio axis. After running THEMIS, we reestimate c _{ m } as the average logratio on the sites whose genotypes are predicted to be “AB” (i.e. no CNA). This new estimate is used in a second run of THEMIS. In practice, the user can also leverage other ploidy estimation tools to get the initial estimate of c _{ m } or run THEMIS multiple times with multiple initial estimates and choose with the one with highest likelihood. Three sets of parameters are estimated from data via the standard expectationmaximization (EM) algorithm for dynamic graphical models:

1.
The variance of allelic ratio in biopsy m (∀m = 1, …, M) on germline heterozygous sites, denoted by \({\sigma }_{A,m\mathrm{,0}}^{2}\), the variance of allelic ratio in biopsy m (∀m = 1, …, M) on somatic mutation sites, denoted by \({\sigma }_{A,m\mathrm{,1}}^{2}\), and the variance of log ratio in biopsy m (∀m = 1, …, M) on each site, denoted by \({\sigma }_{L,m}^{2}\).

2.
The transition probability from genotype j at site t − 1 to genotype i at site t (i, j ∈ {1, …, G), denoted by \({Q}_{G}(i,j;{\bar{h}}_{t})\) and transition probability from clone j at site t − 1 to clone i at site t (i, j ∈ {1, …, Z), denoted by \({Q}_{Z}(i,j;{\bar{h}}_{t})\). We model \({Q}_{G}(i,j;{\bar{h}}_{t})\) and \({Q}_{Z}(i,j;{\bar{h}}_{t})\) as parametric functions of \({\bar{h}}_{t}\), the distance (in base pairs) between site t and site t − 1. We first define the probability of staying at the same genotype j, denoted by \({\rho }_{G}(j;{\bar{h}}_{t})\), as
$${\rho }_{G}(j;{\bar{h}}_{t})=\frac{{\mathscr{N}}(\mathrm{log}\,{\bar{h}}_{t};0,{\sigma }_{G,j}^{2})}{{\mathscr{N}}\mathrm{(0};0,{\sigma }_{G,j}^{2})}\frac{G1}{G}+\frac{1}{G},$$(5)where \({\mathscr{N}}(x;0,{\sigma }_{G,j}^{2})\) is the probability density of a Gaussian distribution with mean 0 and variance \({\sigma }_{G,j}^{2}\) at the point x. Then
$${Q}_{G}(i,j;{\bar{h}}_{t})=(\begin{array}{ll}{\rho }_{G}(j;{\bar{h}}_{t}), & \,\mathrm{if}\,\,i=j\\ \frac{1{\rho }_{G}(j;{\bar{h}}_{t})}{G1}, & \,{\rm{otherwise}}\mathrm{.}\end{array}$$(6)Similarly, we define
$${\rho }_{Z}(j;{\bar{h}}_{t})=\frac{{\mathscr{N}}(\mathrm{log}\,{\bar{h}}_{t};0,{\sigma }_{Z,j}^{2})}{{\mathscr{N}}\mathrm{(0};0,{\sigma }_{Z,j}^{2})}\frac{Z1}{Z}+\frac{1}{Z},$$(7)and
$${Q}_{Z}(i,j;{\bar{h}}_{t})=(\begin{array}{cc}{\rho }_{Z}(j;{\bar{h}}_{t}), & \,\mathrm{if}\,i=j\\ \frac{1{\rho }_{Z}(j;{\bar{h}}_{t})}{Z1}, & \,{\rm{otherwise}}\,\mathrm{.}\end{array}$$(8)Note that we estimate \({\sigma }_{G,j}^{2}\) (∀j = 1, …, G) and \({\sigma }_{Z,j}^{2}\) (∀j = 1, …, Z) from data in a maximum likelihood fashion. The way we parameterize the transition probabilities captures the phenomenon that the probability of staying at the same genotype decreases as a monotone function of the distance from the previous site, and the rate it decreases is parametrized by \({\sigma }_{G,j}^{2}\). Therefore, we estimate the decreasing speed (i.e. \({\sigma }_{G,j}^{2}\)) adaptively from the data (unlike userprespecified transition probabilities in TITAN), and the speed is different for different genotypes since different mutation events may occur with different lengths on the genome (unlike the tied transition probabilities in TITAN).

3.
The prior distributions of genotypes, clones and the cell prevalence levels of the clones (denoted by π _{ G }, π _{ Z } and π _{ P }, respectively). These prior distributions are responsible for the frames that correspond to the starts of the chromosomes. These prior distributions are initialized as uniform distributions, and trained from the data along with other parameters.
We use Ω to denote the set of parameters in the joint distribution specified by the model, namely \({\rm{\Omega }}=\{{\sigma }_{A,m\mathrm{,0}}^{2},{\sigma }_{A,m\mathrm{,1}}^{2},{\sigma }_{L,m}^{2},{\sigma }_{G,j}^{2},{\sigma }_{Z,j}^{2},{\pi }_{G},{\pi }_{Z},{\pi }_{P},\forall m=\mathrm{1,}\ldots ,M,\forall j=\mathrm{1,}\ldots ,Z\}\). In the estimation step, we use the EM algorithm to estimate the parameters and find a (local) maximum, denoted by \(\hat{{\rm{\Omega }}}\), for
In the inference step, we infer the most probable states of the hidden variables given the estimated parameters \(\hat{{\rm{\Omega }}}\), i.e.,
The inferred most probable sequence of hidden variables, as the output of our algorithm, provide the heterogeneity analysis results, i.e., a number of subclones and their genotypes and cell prevalences (Fig. 1b).
Selecting the number of subclones
THEMIS requires the user to specify the number of subclones in the biopsies before running the model. There are three ways of identifying the number of subclones from the data. The first method is a naive visualization method. If the biopsy is well sequenced, then the number of subclones can be directly identified from the bivariate plot (allelic ratio against log ratio at germline heterozygous sites) of the biopsy by observing different prevalence levels of the LOH events. We take one tumor biopsy (biopsy B1 in ITOMIC study) as an example, whose bivariate plot is provided in Supplementary Fig. 6. It can be observed that there are two major LOH prevalence levels in the plot. Therefore, we can assume that there are two tumor subclones in the biopsy.
Another way of choosing the number of subclones is to use the Bayesian information criterion (BIC)^{16}. BIC is defined as
where ln L is the log likelihood of the data, k is the degree of freedom, and n is the number of data points. We choose the subclone number which produces the smallest BIC score. When we run the 2subclone model and the 3subclone model on the tumor biopsy B1, BIC scores are −476,725 and −464,710, respectively. Therefore, we can assume there are two tumor subclones in the biopsy based on the BIC scores.
A third way of choosing the number of subclones is to use crossvalidation. Suppose that we use threefold crossvalidation. We randomly partition the chromosomes into three sets. In each trainingtesting split, we use the data from two sets as the training data and the remaining set as the testing data. With the estimated parameters from training data, we run Viterbi algorithm on the testing data, and choose the subclone number which produces the largest averaged loglikelihood (a.k.a., the Viterbi score in GMTK) on testing data. In both simulated data (in Simulations 1 and 2, we simulated two tumor subclones and three tumor subclones, respectively) and real data (biopsy B1 in the ITOMIC study), we observed that the three methods provide the correct results (Supplementary Table 7).
In practice, one may use any of the three methods or a combination of the three methods to set the number of subclones. Note that although the naive visualization method is straightforward, it may produce inaccurate estimates if the sequencing depth of the biopsy is low or when the prevalence levels of two subclones are close to each other. Crossvalidation is more robust compared with BIC, but requires additional computational cost. When BIC and crossvalidation are being used, we recommend starting with a small number of subclones (e.g. 2) and increase the number until the evaluation criteria deteriorate. For example, if a 2subclone model produces a lower BIC (or higher averaged ln L in crossvalidation) than a 3subclone model, it is not necessary to run the 4subclone model.
Joint analysis of multiple biopsies from the same patient
The joint analysis of multiple biopsies in THEMIS is done by first enumerating candidate phylogenetic trees, encoding each tree in the conditional probability tables associated with variables A _{ m,t } and L _{ m,t }, and then selecting the tree whose associated model yields the highest likelihood. During the enumeration phase, we make three assumptions. First, we assume that we have the statistical power to discern all the clones from individual biopsies and estimate their prevalences. Second, we assume that we can identify shared clones between biopsies by computing and thresholding similarities between the clones. Third, if the sum of the prevalences (p _{ a } and p _{ b }) of clones a and b is greater than 1.0 in at least one biopsy, and p _{ a } > p _{ b } in all biopsies where clones a and b are present, then we assume that clone a is an ancestor of clone b. The first two assumptions ensure that the ground truth structure is contained in the candidate structures. The third assumption helps us reduce the number of candidate structures. Users are also encouraged to use other information, such as the time and physical locations of the biopsies, to eliminate candidate structures. Joint analysis over multiple biopsies can be carried out in the following five steps.

1.
Analyze biopsies separately with THEMIS and identify the genotype and prevalence of each clone within each biopsy.

2.
Compute similarities between all pairs of clones from different biopsies and merge similar clones.

3.
Identify consistent parentchild relationships based on the individually estimated prevalences using the third assumption above.

4.
Enumerate all phylogenies consistent with those relationships and run THEMIS accordingly.

5.
Select the phylogeny with maximum likelihood.
References
Shah, S. P. et al. The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature 486, 395–399 (2012).
Nowell, P. C. The clonal evolution of tumor cell populations. Science 194, 23–28 (1976).
Gerlinger, M. et al. Cancer: Evolution within a lifetime. Annual Review of Genetics 48, 215–236 (2014).
Frampton, G. M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel dna sequencing. Nature Biotechnology 31, 1023–1031 (2013).
Deshwar, A. G. et al. Phylowgs: Reconstructing subclonal composition and evolution from whole genome sequencing of tumors. Genome Biology 16 (2015).
ElKebir, M., Satas, G., Oesper, L. & Raphael, B. J. Inferring the mutational history of a tumor using multistate perfect phylogeny mixtures. Cell Systems 3, 43–53 (2016).
Jiang, Y., Qiu, Y., Minn, A. J. & Zhang, N. R. Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by nextgeneration sequencing. Proceedings of the National Academy of Sciences 113, E5528–E5537 (2016).
Bilmes, J. & Zweig, G. The Graphical Models Toolkit: An open source software system for speech and timeseries processing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2002).
Ha, G. et al. Titan: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Research 24, 1881–1893 (2014).
Blau, C. A. et al. A distributed network for intensive longitudinal monitoring in metastatic triplenegative breast cancer. Journal of the National Comprehensive Cancer Network 14, 8–17 (2016).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nature Protocols 4, 44–57 (2009).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research 37, 1–13 (2009).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology 31, 213–219 (2013).
Benjamini, Y. & Speed, T. P. Summarizing and correcting the gc content bias in highthroughput sequencing. Nucleic Acids Research 40, gks001 (2012).
Ha, G. et al. Integrative analysis of genomewide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triplenegative breast cancer. Genome Research 22, 1995–2007 (2012).
Schwarz, G. Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978).
Yau, C. et al. A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biology 11, R92 (2010).
Letouzé, E., Allory, Y., Bollet, M. A., Radvanyi, F. & Guyon, F. Analysis of the copy number profiles of several tumor samples from the same patient reveals the successive steps in tumorigenesis. Genome Biology 11, 1–19 (2010).
Greenman, C. D. et al. Estimation of rearrangement phylogeny for cancer genomes. Genome Research 22, 346–361 (2012).
Carter, S. L. et al. Absolute quantification of somatic dna alterations in human cancer. Nature Biotechnology 30, 413–421 (2012).
Strino, F., Parisi, F., Micsinai, M. & Kluger, Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Research 41, e165 (2013).
Oesper, L., Mahmoody, A. & Raphael, B. J. THetA: inferring intratumor heterogeneity from highthroughput DNA sequencing data. Genome Biology 14, R80–R80 (2013).
Oesper, L., Satas, G. & Raphael, B. J. Quantifying tumor heterogeneity in wholegenome and wholeexome sequencing data. Bioinformatics 30, 3532–3540 (2014).
Purdom, E. et al. Methods and challenges in timing chromosomal abnormalities within cancer samples. Bioinformatics 29, 3113–3120 (2013).
Yau, C. OncoSNPSEQ: a statistical approach for the identification of somatic copy number alterations from nextgeneration sequencing of cancer genomes. Bioinformatics 29, 2482–2484 (2013).
Roth, A. et al. Pyclone: statistical inference of clonal population structure in cancer. Nature Methods (2014).
Miller, C. A. et al. Sciclone: Inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Computational Biology 10, e1003665 (2014).
Zare, H. et al. Inferring clonal composition from multiple sections of a breast cancer. PLoS Computational Biology 10, e1003703 (2014).
Fischer, A., VázquezGarca, I., Illingworth, C. J. & Mustonen, V. Highdefinition reconstruction of clonal composition in cancer. Cell Reports (2014).
Schwarz, R. F. et al. Phylogenetic quantification of intratumour heterogeneity. PLoS Computational Biology 10, e1003535 (2014).
Qiao, Y. et al. Subcloneseeker: a computational framework for reconstructing tumor clone structure for cancer variant interpretation and prioritization. Genome Biology 15, 443 (2014).
Hajirasouliha, I., Mahmoody, A. & Raphael, B. J. A combinatorial approach for analyzing intratumor heterogeneity from highthroughput sequencing data. Bioinformatics 30, i78–i86 (2014).
Fan, X., Zhou, W., Chong, Z., Nakhleh, L. & Chen, K. Towards accurate characterization of clonal heterogeneity based on structural variation. BMC Bioinformatics 15, 299 (2014).
Jiao, W., Vembu, S., Deshwar, A. G., Stein, L. & Morris, Q. Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics 15, 35 (2014).
Sengupta, S. et al. Bayclone: Bayesian nonparametric inference of tumor subclones using NGS data. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, vol. 20, 467 (World Scientific, 2015).
Malikic, S., McPherson, A. W., Donmez, N. & Sahinalp, C. S. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics 31, 1349–1356 (2015).
Popic, V. et al. Fast and scalable inference of multisample cancer lineages. Genome Biology 16, 1 (2015).
ElKebir, M., Oesper, L., AchesonField, H. & Raphael, B. J. Reconstruction of clonal trees and tumor composition from multisample sequencing data. Bioinformatics 31, i62–i70 (2015).
Acknowledgements
The authors would like to thank the patient and her family. We gratefully acknowledge the anonymous reviewers for their valuable feedback. We also gratefully acknowledge the support from the Washington Research Foundation Fund for Innovation in DataIntensive Discovery, the Moore/Sloan Data Science Environments Project at the University of Washington, the Amazon Research Credits program, and South Sound CARE.
Author information
Authors and Affiliations
Contributions
J.L., J.A.B. and W.S.N. conceived the method. J.L. and J.T.H. performed the analysis. J.L., W.S.N., and C.A.B. wrote the manuscript. E.M.M., C.S., M.O.D., and C.A.B. contributed experimental results. S.B., V.K.G., and C.A.B. contributed to the clinical trial. R.M.D., C.L., E.M.M., D.P., P.R., and J.S. acquired single cell sequencing data.
Corresponding authors
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, J., Halloran, J.T., Bilmes, J.A. et al. Comprehensive statistical inference of the clonal structure of cancer from multiple biopsies. Sci Rep 7, 16943 (2017). https://doi.org/10.1038/s41598017168134
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598017168134
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.