Cancer metastasis networks and the prediction of progression patterns

Background: Metastasis patterns in cancer vary both spatially and temporally. Network modelling may allow the incorporation of the temporal dimension in the analysis of these patterns. Methods: We used Medicare claims of 2 265 167 elderly patients aged ⩾65 years to study the large-scale clinical pattern of metastases. We introduce the concept of a cancer metastasis network, in which nodes represent the primary cancer site and the sites of subsequent metastases, connected by links that measure the strength of co-occurrence. Results: These cancer metastasis networks capture both temporal and subtle relational information, the dynamics of which differ between cancer types. Using these networks as entities on which the metastatic disease of individual patients may evolve, we show that they may be used, for certain cancer types, to make retrograde predictions of a primary cancer type given a sequence of metastases, as well as anterograde predictions of future sites of metastasis. Conclusion: Improvements over traditional techniques show that such a network-based modelling approach may be suitable for studying metastasis patterns.

To date, study of such organotropic dissemination patterns have relied primarily on autopsies. These have all been relatively small-scale studies (Abrams et al, 1950;Weiss et al, 1988;Weiss, 1992;Disibio and French, 2008). However, we may now study these patterns using computational methods on large data sets. Disease 'comorbidity' can be thought of and analysed as a network (Lee et al, 2008). Indeed, cancer contains many manifestations of networks at various levels of organisation, including the genetic (Tavazoie et al, 1999;Aggarwal et al, 2006), cellular (Vogelstein et al, 2000;Irish et al, 2004), and phenotypic (Hidalgo et al, 2009). We argue that a topographic network for metastases can be constructed as well; here, the nodes represent sites where metastases may arise and links represent the cooccurrence of such metastases. Such a network dynamically evolves as cancer progresses to more advanced stages. One may imagine a sequence of metastatic events in a patient as a trajectory on this dynamic cancer metastasis network. Such a network may yield further insights into the nature and patterns of metastatic dissemination.
Medicare data allow us to look at patterns of metastatic dissemination on a massive scale, across a broad range of cancer types and secondary sites. The ability to do this is aided not only by the sheer size of the data set, but also by the fact that the data are diagnosis driven. As opposed to data derived from autopsy studies, this provides the advantage of being more clinically relevant in terms of patient management -each diagnosed metastasis at a secondary site is recorded as a separate event. In addition, these data give us another dimension, that is, time. However, it is important to note that this data set restricts our patient population to those aged X65 years.
The aim of this study is not to compare cancers by pathological, molecular, or genetic characteristics, as most studies do, but rather to analyse progression dynamics by the anatomical site of origin. By analysing cancer metastasis using networks, we can derive, quantify, and compare the topographical patterns on a large scale. In addition, we can analyse the dynamics of these networks and their structural properties, using them as the basis for the development of betterperforming predictive algorithms. Using these networks as entities on which the metastatic disease of individual patients evolve, we hypothesise that we may make retrograde predictions of primary cancer types given a sequence of metastases and anterograde predictions of future sites of metastasis.

Clinical data
We used the so-called Medicare Provider Analysis and Review (MedPAR) records for 1990-1993, containing a comprehensive set of all the Medicare claims of 13 039 018 elderly patients aged X65 years, who were hospitalised during this 4-year period (we excluded the minority of Medicare beneficiaries o65 years). Such records are highly complete and accurate and have been used for epidemiological and other research (Fisher et al, 1990;Mitchell et al, 1994;Christakis and Allison, 2006); the coverage of the Medicare programme encompasses 35 million beneficiaries (Landon et al, 2004). For every hospital visit, up to 10 disease diagnoses are recorded in the International Classification of Diseases version 9 with Clinical Modification (ICD-9-CM) format. We extracted the subset of patients who had at least one diagnosis within the range of 140 -239, which represent neoplasms in the ICD-9 classification scheme. This subset contains 2 265 167 patients, with a total of 6 773 633 hospital visits. Of this subset, 1 420 538 patients had only one neoplasm diagnosis, 488 623 had two, and 191 726 had three. The maximum number of neoplasm diagnoses was 17 (two patients). For each patient, we collapsed all neoplasm diagnosis records into a single sequence of diagnoses, along with the number of hospital visits, the number of neoplasm diagnoses, and the follow-up time.
Follow-up time was defined here as the length of time from the diagnosis of the primary cancer to the last diagnosis of any disease.
We then separated patients into groups according to the anatomical site of the primary tumour. The ICD-9 scheme codes neoplasms based on anatomical location rather than histology or other pathological characteristics, and thus our grouping is effectively by ICD-9 number. Supplementary Table S1 shows the three-digit ICD-9 codes corresponding to the 43 selected primary cancer types. Certain groups are less specific than others and include more biologically dissimilar tumour types. Certain groups may also contain many more patients than others, reflecting the nonuniform incidence of cancer based on tissue type and anatomical site. Nonetheless, this grouping allows for a reasonably high-resolution categorisation of anatomical sites. For metastasis diagnoses, we used four-and fivedigit codes within 196, 197, and 198, which are similarly classified according to anatomical location. Supplementary Table S2 lists the 27 metastatic sites selected, which include lymph nodes, as well as distant tissues and organs.

Construction of cancer metastasis networks
Patients were censored by overall follow-up time. In other words, at every point of time, only patients with a longer overall follow-up time are considered. This ensures that the analysed patients are still in the system at a particular point of time, and that we can be confident that they have not died. The nodes of a cancer metastasis network represent the distant sites where metastases may arise for a given primary tumour type. The size of each node represents its conditional incidence or hazard. We defined the incidence hazard function as where m met (t) is the number of diagnoses of metastasis met at time t, and N X (t) is the number of patients remaining at time t (where all the patients with an overall follow-up time less than t are censored) for primary tumour type X. We used discrete times of 1 month, so therefore t ¼ t i -t i -1 ,i ¼ 0y48. The cumulative hazard for an X and met pair is then simply: To quantify the dynamics of metastasis development, we looked at the incidence of metastases in terms of co-occurrence at every point of time. This allows us to establish links between the primary tumour and metastasis sites, as well as between different metastasis sites for multiple cases.

Co-occurrence measures
We quantified co-occurrence using two measures, the j-correlation (Pearson's correlation between dichotomous variables) and relative risk (RR). The j-correlation is defined as: where C ij (t) is the number of co-occurrences at time t. i and j represent particular sites of metastasis or the primary tumour itself (in other words, one may discover links either between the primary tumour and specific sites of metastasis, or between two different sites of metastasis). X represents the primary tumour type. t ¼ t i -t i -1 ,i ¼ 0y48. RR is defined as: When i and j are observed together more than random chance would dictate, RR 41 and j 40. Although relative risk is used quite commonly in the medical literature, it has certain drawbacks when used in this context. RR tends to be biased toward higher values when looking at metastases of low incidence, whereas it is biased toward lower values when looking at those of high incidence. The j-correlation, on the other hand, is biased toward zero when analysing the link between metastases of differing incidence or prevalence. However, j tends to be the better measure for analysing links across multiple cancer metastasis networks, as its scale would fluctuate much less than that of RR. This is because the values are better normalised to their respective population sizes, even though the underlying patient population of two primary cancer networks may be quite different.

Models for predicting the primary cancer site
On the basis of the metastatic patterns, it may be possible to predict the site of a primary, occult cancer. Multinomial logistic regression: We first used multinomial logistic regression (MLR) to build an algorithm for predicting the site of a patient's primary cancer, given the vector of their sites of metastases. The data were split into half, with patients randomly assigned to either a training set or a test set. For MLR, we used information only on whether a metastasis at a particular site was detected, disregarding the time at which it was detected. We derived the coefficient estimates with a hierarchical model using the training set, and subsequently applied this model on the test set to assess its accuracy. Usage of cancer metastasis networks (I): We then developed an algorithm using the metastasis networks, incorporating the additional variable of time. Given a sequence of metastases, M ¼ {m 1 (t 1 ),m 2 (t 2 ),y,m n (t n )}, we define the following matrix: where X denotes the primary cancer type. Each primary cancer site has its own metastasis network, and thus this matrix is a summary of the properties of network X at those nodes and links specified by M, the temporal sequence of metastases for a single patient. For each patient, the predicted site of their primary cancer is the site X, which yields the largest value of ||O X ||.

Models for predicting secondary cancer sites (metastases)
Knowing the primary cancer type, it is clinically important to know how likely it is that metastases may arise and where they may occur, as this will change the staging of the disease, which in turn will guide treatment options. Fractional method: In the medical literature, metastasis patterns are often reported as percentages or fractions, without any temporal information. For example, if 30% of patients with breast cancer had or eventually became diagnosed with bone metastases, then for a new breast cancer patient, we will say that this patient has a 30% chance of developing a bone metastasis. For each primary cancer type, we split the patients randomly into either a training set or a test set. Using the training set to derive the fractions of patients developing metastases to each distant site, we then applied those fractions to the test set. We sequentially analysed patients having n mets ¼ 1, 2, 3, and 4 metastases. For each patient with primary cancer type X in the test set, the probability of an accurate prediction, p f , is the fraction of patients with primary cancer type X in the training set developing m n , given m 1 ,m 2 ,y,m n -1 . However, we discard this condition by analysing only the nth metastasis. That is, using n mets ¼ 3 as an example, we assume m 1 and m 2 , so p f is simply the probability of developing m 3 . This allows for more direct comparisons. The overall accuracy is then the mean, p f . Usage of cancer metastasis networks (II): With the fractional method as a baseline for comparison, we developed an algorithm for predicting future sites of metastases using cancer metastasis networks. We may think of these networks as entities on which the metastatic disease of individual patients evolve, and are able to incorporate temporal dynamics, as well as subtle relational properties. We developed cancer metastasis networks for each primary cancer type using the training set. For each patient in the test set, the probability of an accurate prediction for m n , p net , given the primary cancer type X and metastases m 1 ,m 2 ,ym n -1 , is calculated by (see Figure 5A for a graphical summary): where Q are all the links connecting m n to the node for the primary cancer site or the nodes m 1 ,m 2 ,ym n -1 , R are all the links from the nodes m 1 ,m 2 ,ym n -1 or the primary cancer node, S are all the links between any combination of the nodes m 1 ,m 2 ,ym n -1 or the primary cancer node, and t n is the time corresponding to the incidence of metastasis m n . Only j X,ij (t) with P-value o0.05 are considered. We analysed separately patients having n mets ¼ 1, 2, 3, and 4 metastases. The overall accuracy is then the mean, p net . The ratio p net /p f captures the improvement over the fractional method of using these networks for prediction.

Cancer metastasis networks
We constructed cancer metastasis networks for 43 primary sites, as listed in Supplementary Table S1. We then considered 27 possible secondary sites of dissemination for these primary cancers, as listed in Supplementary Table S2. Nodes represent metastasis sites, and thus number 27 in each network. The incidence of different types of cancer as captured by the data set is shown in Supplementary Figure S1. The largest numbers of diagnoses are for prostate, colon, lung, and bladder cancer. For the majority of cancer primary sites, the pattern of metastatic dissemination sites is quite selective, with a few sites having very strong links and many others holding comparatively weaker links.

Cancer metastasis network dynamics
Metastasis conditional incidence (hazard) functions for cancers arising at six primary sites are shown in Figure 1. Each curve represents the hazard function for a particular secondary site. Similarly, with the metastasis network links, we can plot their dynamics over time. Figure 2A is the colon cancer-specific metastasis network at t ¼ 0, and Figure 2B shows the network at t ¼ 48 months. We can extract dynamical information from the evolution of the network links over time. Figures 2C and D show, for the array of all possible pair-wise links, the monthly increase in the link strength. For any given pair, link strength representing the likelihood metastases at one anatomical site will be found simultaneously with metastases at the other site. Only statistically significant links are shown. Figure 2C, which uses the phi measure to characterise link strength, creates a more detailed picture of the overall dynamics. Initially, a few links steadily and solidly increase. As cancers progress, many more links are added, and link addition becomes much more scattered, and thus covers many more link possibilities. As a consequence, at t ¼ 0, the strength distribution of these links is narrow and centred about a relatively low strength value (Supplementary Figure S2A). As the cancers progress, these distributions naturally shift toward higher strength values, and evolve toward a more uniform profile. Using the information on link dynamics for each network, we can then compare the networks and determine how similar they are to one another, across distinct cancer types. This takes into account not only topography but dynamics as well. We measured the pair-wise correlations between metastasis network links for every primary cancer type. The correlation coefficient matrix is shown in Supplementary Figure S3. Although the vast majority of primary cancer types exhibit low correlation values with one another based on this approach, a few do stand out: (i) 'colon' and 'rectum and anus', (ii) 'lung and bronchus' and 'prostate', (iii) 'breast, female' and 'prostate'. Although 'colon' and 'rectum and anus' should be expected to emerge as correlated, being of essentially the same tissue, the other two pairs are less expected. Breast and prostate cancer both metastasise with high affinity to the bone (Yoneda, 1998), and are both slower-progressing cancers (Peer et al, 1993;Barry, 2001), which may explain why the two also emerged as a highly correlated pair in terms of the metastasis network link dynamics. The correlation between lung cancer and prostate metastasis dynamics is more puzzling. However, as this analysis is looking at the links, and not the nodes, more subtle mechanisms are at play, and so perhaps more in-depth experimental research on lung and prostate cancer metastasis dynamics seems warranted.

Topographical clustering
Results of the hierarchical clustering of the sites of primary tumour and the sites of metastasis by their incidence hazard function are shown in Figure 3 (t ¼ 0 in Figure 3A and t ¼ 48 months in Figure 3B). At t ¼ 0, primary cancer types are clustered in three large groups with distinct patterns of metastasis development. The first group, which includes the ovary, pancreas, gallbladder, Although labels for the corresponding curves are not shown, this vector of metastatic sites follows the same ordering among these four graphs, thus revealing distinct spatial and dynamical patterns.
rectum and anus, colon, small intestine, and stomach, very strongly metastasise to the peritoneum, liver, and intra-abdominal lymph nodes. The second group, which includes the hypopharynx, oropharynx, tongue, thyroid, nasopharynx, floor of the mouth, gum, larynx, and lip, metastasise strongly to the lymph nodes in the head, face, and neck, and to a lesser degree, the bone and lung. The third group, which includes cancers, such as lung, prostate, bone, testis, kidney, liver, oesophagus, uterus, cervix, skin (melanoma), and others, include cancers which at t ¼ 0 tend to exhibit metastasis profiles with broader specificity and comparatively lower magnitudes. Breast cancer, however, is clustered by itself because of the strong affinity for the axillary lymph nodes. Through all of this, it must also be kept in mind that different cancers have differing proportions of the stages at which they are presented at diagnosis, because of the varying natural histories and different abilities in screening and detection (Halpern et al, 2008;Jemal et al, 2008). However, this clustergram reveals a distinct pattern arranged strictly by anatomical location. By t ¼ 48, the pattern becomes more perturbed, but much of the anatomical arrangement present earlier is still preserved.

Prediction of the primary cancer site from a sequence of metastases
The multinomial logistic regression model achieved an overall accuracy of 51%, with most patients being classified as one of the six major cancer types. Prostate was correctly classified (true positive rate) 84% of the time, colon 80%, lung and bronchus 69%, ovary 64%, larynx 61%, and female breast 56% (Supplementary Table S3). The other cancer types had a true positive rate of o10%.
Hess et al, 2006 developed a similar MLR algorithm for predicting the primary cancer site (nine sites) given a set of metastases. An overall accuracy of 64% was achieved, which is slightly better than the overall accuracy of our algorithm, but it must be kept in mind The cancer metastasis network for colon cancer (chosen as a representative example) and the dynamics of its links. (A), network at t ¼ 0, or the time of diagnosis of the primary tumour. (B), network at t ¼ 48 months. Nodes correspond to anatomical sites of metastases, the size of which represents their respective incidence rates. The widths of the links represent the strength of metastasis co-occurrence for two anatomical sites. Yellow nodes represent lymph node metastases; red nodes represent organ metastases. The curves in Figure 1 represent the monthly growth of these nodes, whereas the following (phi) represents the monthly growth of the links: (C), metastasis site co-occurrence associations as measured by phi over time. All the possible associations are lined up on the x axis, and their temporal dynamics are represented by the y axis. Only phi with P-value o0.01 are shown. (D), metastasis site cooccurrence associations as measured by relative risk over time. Only RR values with 99% confidence interval or RR o0.1 are shown.
that we used 43 primary cancer sites (which includes many less common sites). Rather than classifying patients into one of the six major cancer types, the network model for predicting the primary cancer site classifies patients into many more categories. Eleven cancer types achieved a true positive rate of 425%, most of which are less common cancers (Supplementary Table S4 Figures S4 and S5).
into other categories, ovary had a true positive rate of 81%, hypopharynx 75%, male breast 70%, pleura 46%, pancreas 40%, small intestine 39%, female breast 37%, male genital 33%, lung and bronchus 28%, cervix 27%, and female genital 26%. Even though the overall accuracy may be less than that of the MLR algorithm, the network model has the advantage of broader specificity and sensitivity toward cancers of less common sites (Supplementary Table S5). The true positive rates for those sites exhibiting true positive rates 425% with either method are shown in Figure 4.

Prediction of additional secondary cancer sites (metastases)
Although the previous method should prove helpful in the case of an occult primary neoplasm that -other than the symptomatic metastases -does not yet show on imaging, perhaps the more clinically useful prediction is the forward prediction of additional possible metastases. We therefore compared a cancer metastasis network-based algorithm and a traditional fractional method on patients with 1, 2, 3, and 4 metastases. Those results are summarised in Table 1 and Supplementary Table S6. Figure 5A is a graphical summary of the algorithm methodology. For patients with 1 metastasis, predicting m 1 turns out to be no better than using the fractional method. This is expected, as the strength of those links directly connected to the primary cancer node is proportional to their respective metastasis incidences. However, with n mets 41, the algorithm with the network model performs better than the fractional method for the majority of primary cancer sites ( Figures 5B -D

Discussion
Through a large data set of cancer patients, we have investigated the topographical patterns of clinical metastasis development using a network approach. Although the 'seed and soil' hypothesis (Fidler, 2003) certainly still holds, both anatomical proximity and anatomical connection seem to be dominant factors when the analysis of metastatic sites includes many more sites and many more primary cancer types. To our knowledge, such a comprehensive study has not previously been conducted, especially not one including the rarer cancer types.
Our study has shown that treating secondary metastases as separate, comorbid diseases allows the construction of cancer type-specific metastatic progression networks. From these networks, we are able to analyse the dynamics of each cancer-specific network and compare one network to another. Furthermore, we are able to use these networks as the basis of predictive algorithms, which we have shown in many cases to be better performing than conventional algorithms. We note that there are also other types of models one can build for comparison, such as a Markov model or a Cox model with time-dependent covariates. However, these may be better suited to smaller-scale studies with more detailed information on the underlying variables.
In Figure 1, we showed that the profile of hazard functions for certain types of cancer can be highly specific, such as in prostate cancer, or it can have a much broader profile, such as in bladder cancer. Broader profiles create for three possibilities. The first possibility is that these cancer types truly do have lesser selectivity in the sites of secondary dissemination. The second possibility is that the cancer type categories as defined here encompass a broad range of sub-classifications, each of which may exhibit distinct patterns by themselves. A third possibility is that these cancer types have tumours, which display more cellular heterogeneity, with different clonal populations within the tumours possessing different affinities to distant sites (Fidler, 1978(Fidler, , 2003. In recent years, we have come to discover a number of molecules that drive organ specificity, but it still does not necessarily answer the question of why different types and subtypes of cancer metastasise to specific secondary sites, and with varied propensities. Consideration of the original predisposition of the transformed cell of origin suggests several possibilities that may explain these phenomena. Certain cell lineages may express molecules that bias the metastatic efficiency to various target organs. For example, both normal and cancerous mammary epithelial cells express Receptor Activator of Nuclear Factor kB (RANK) -the receptor for the osteoclast differentiation factor Receptor Activator of Nuclear Factor kB ligand (RANKL) (Dougall and Chaisson, 2006). Studies suggest that this receptorligand combination may predispose breast cancer cells to colonise bone (Roodman, 2004;Yoneda and Hiraga, 2005;Jones et al, 2006). The developmental history of a cell may also predispose it to activate expression of specific metastasis-promoting mechanisms on malignant transformation. Lineage-specific signalling circuits may create differential responses to the same oncogenic alterations, or developmentally imprinted epigenetic modifications may influence transcriptional accessibility of the transformed genome. Therefore, one would expect cells that are developmentally similar to act in a more similar manner than cells further apart in lineage. Indeed, in Figure 3B, we see that at t ¼ 48 months, a time when most cancers have progressed to an advanced stage, the clustering suggests that this developmental effect may have a role, in addition to blood flow characteristics. The clustergram groups primary sites into more or less three groups. The first group is composed of abdominal sites connected through the hepatic portal system. The second group is comprised of various sites in the torso, and the third group is comprised of various sites in the head and neck. Anatomical arguments seem to still dominate.  Figure 4 Prediction of the primary cancer site from a sequence of metastases. The primary cancer types for which the true positive rates exceed 25% from each model are shown. The multinomial logistic regression (MLR) algorithm takes into account the number of patients in the respective categories, and therefore, a relatively rare cancer type will be classified as a common cancer type with similar metastasis patterns. The MLR algorithm and the network algorithm perform in different ways: the MLR classifies everything into a few common cancer types, whereas the network algorithm is able to differentiate between rarer cancer types.
Although no similar large-scale study has been carried out comparing the metastasis development patterns by anatomical site of a large numbers of different cancer types, the data set we use does carry with it some limitations. The Medicare claims data contain information on 96% of Americans X65 years of age (Hatten, 1980). This provides excellent coverage of an entire demographic, yet does not represent a cross-section of the population-at-large in terms of age. The data set also does not contain information on patients who were not hospitalised. Within the data, diseases were recorded in the ICD-9-CM classification scheme format and potential errors for using the ICD classification scheme at entry, and disease coding in general have been noted (Surján, 1999). However, Medicare claims data have been shown to be both accurate (Zhang et al, 1999;Hennessy et al, 2007) and sensitive (Cooper et al, 1999). Finally, the record of an event of metastasis is a function of its clinical detection. Metastases are not typically one or two new growths -they may number in hundreds or more, many of which are below thresholds of clinical detection. Micrometastatic disease is an important prognostic indicator (Pantel et al, 1999), but they will escape detection and thus not be recorded. This restricts both the spatial and temporal resolutions.
We have nonetheless been able to show that the cancer metastasis network captures important and useful temporal and relational information, and thus be able to serve as the basis for better predictive algorithms. Using a network approach, additional questions on metastatic dissemination can be explored -for example, how network properties and characteristics change by age, gender, race, disease stage, or treatment. In attempts to explore these and other questions, the Surveillance, Epidemiology and End Results (SEER)-Medicare linked data set may be a useful resource (http://seer.cancer.gov). The study of more specific networks may yield further insights into the metastatic cascade and the patterns of metastasis per se. In addition, coupling molecular information underlying cancer with these phenotypic networks may also prove useful, and possibly lead to better treatment of Table 1 Accuracy of the network model in predicting the temporal sequence of metastases, by site of primary cancer, for n mets ¼ 2 and n mets ¼ 3 p f is the mean accuracy for the fractional method, p net is the mean accuracy for the network model, and p net /p f is their ratio. N is the number of patients in each case.
Cancer metastasis networks and progression patterns LL Chen et al metastatic cancer. At the very least, these metastasis networks may be used to identify a likely sequence of metastases in a patient, and thus guide diagnostic tests and specific treatment targeting those sites.  Figure 5 Using the cancer metastasis networks to predict the temporal sequence of additional metastases. (A), diagram of how p net is calculated, for the case of n mets ¼ 2. To calculate the probability of developing a metastasis at site m2, given the primary cancer type represented in blue and a metastasis already having developed at site m1, the strength of the green links (represented by their widths) are summed, and divided by the summation of the strength of the red and green links. The grey links are ignored. (B), p f vs p net , for n mets ¼ 2. (C), p f vs p net , for n mets ¼ 3. (D), p f vs p net , for n mets ¼ 4. Each point represents a primary cancer type. Red represents the primary cancer types for which p net op f , and blue represents those for which p net 4p f .