A Simple 3-Parameter Model for Cancer Incidences

We propose a simple 3-parameter model that provides very good fits for incidence curves of 18 common solid cancers even when variations due to different locations, races, or periods are taken into account. From a data perspective, we use model selection (Akaike information criterion) to show that this model, which is based on the Weibull distribution, outperforms other simple models like the Gamma distribution. From a modeling perspective, the Weibull distribution can be justified as modeling the accumulation of driver events, which establishes a link to stem cell division based cancer development models and a connection to a recursion formula for intrinsic cancer risk published by Wu et al. For the recursion formula a closed form solution is given, which will help to simplify future analyses. Additionally, we perform a sensitivity analysis for the parameters, showing that two of the three parameters can vary over several orders of magnitude. However, the shape parameter of the Weibull distribution, which corresponds to the number of driver mutations required for cancer onset, can be robustly estimated from epidemiological data.

Cancers arise after accumulating epigenetic and genetic aberrations [1][2][3] . Earlier studies established a power law model on the basis of multi-stage somatic mutation theory to explain age-dependent incidences 4-6 for several cancer types. As noted by Hornsby et al. 7 in the context of classical epidemiological studies most cancers occur with the same characteristic pattern of incidence, and the simplicity of this pattern is in contrast to the perceived complexity of carcinogenesis. Orthogonal to these age stratification of different cancer types, Tomasetti and Vogelstein 8 (with follow-ups 9,10 ) reported a significant association between life time caner risk and stem cell divisions and concluded the latter substantially contributes to the former. Challenging the conclusion of Tomasetti and Vogelstein 8 of a high-intrinsic cancer risk Wu et al. 11 subdivided cancer risk into extrinsic and intrinsic risk, arguing extrinsic factors contribute more to cancer risks than intrinsic factors do. Based on a mechanistic model of accumulated mutations, these authors provided a recursion formula for theoretical life time intrinsic risk (tLIR) parameterized by age a. This recursion formula has the closed form solution = − − − − + ⋅ a r tLIR( ) 1 (1 (1 (1 ) )) S d a k S log 2 , where S can be interpreted as the numbers of stem cell, d as the stem cell division rate, k as number of driver events required for cancer onset and r as the mutation rate per division. They reported that tLIR goes outside of the plausible range of empirical cancer risks by studying several pairs of values for two parameters (mutation rate and driver gene mutations) concluding that there is a substantial contribution of extrinsic risk factors to cancer development. However, this conclusion only holds in the studied parameter space and when parameters for all cancer types are treated uniformly. By performing a systematic grid search in the space of biologically plausible parameter values we showed that tLIR can be close to empirical risk for different cancer types (R 2 > 0.85). If the extrinsic risk factor is computed by simply setting it to a complement of 1 for the intrinsic risk factor as performed by Wu et al. it will be concluded that there is a possibility of high intrinsic risk, so that one of the presented arguments by Wu et al. 11 is fallacious.
On a pure mathematical side, we show that a scaled Weibull function with 3 parameters approximates the 4-parameter mechanistic tLIR model. On an epidemiological data analytical side, this simple 3-parameter model excellently agrees with age-dependent cancer incidence curves among 18 common solid cancers even when variations due to different locations, races, or periods are taken into account. With this model, we study the

Approximation of tLIR model by a scaled Weibull function. As is derived in the Materials and
Methods the 4-parameter mechanistic tLIR model can be approximated by a scaled Weibull function with 3 parameters: assuming that λ is defined by λ = .
− S rd P ( ) (2) k k is the cumulative distribution function of the Weibull distribution, and P is the number of independent parallel processes, which e.g. can be interpreted as cell population at risk 12 . Whether the total tissue cells or only a fraction of stem cells are susceptible for cancer risk is unclear 13,14 . If one sets P = S then rd = λ −1 . However, other possible choices for P allow to account for other factors such as the selection of mutations 15,16 , the stem cell microenvironment 17,18 , and tissue architecture [19][20][21][22] , or effects of clonal expansion [23][24][25] . Models incorporating clonal expansion have additional parameters such as the number of clonal copies. Reducing the dimensions of such complexed models results in tLIR, in which S is interpreted as number of independent clusters after clonal expansion rather than the number of stem cells, r and d denote "net" mutation and division rate of independent clusters at average level rather than those of single cells. Whereas a precise analysis of models for clonal expansion will be the topic of future work, these considerations show that when using the scaled Weibull distribution, prior knowledge on the parameter ranges is not necessary. This is indeed one benefit of scaled Weibull function comparing to tLIR model which requires a biologically reasonable guessing on stem cell numbers, mutation rate, cell division rate and number of driver mutations. The Weibull distribution is a special case of the generalized extreme value distribution (GEV) 26 . The GEV distribution plays the same role within extreme value statistics as the normal distribution does in average value statistics. It results in the limit distribution being maximized over many independent and identically distributed random variables, thus becoming the default model for the accumulation of micro events which finally leads to a macro event. The GEV is the limit distribution when one takes the maximum (and not the sum) of many independent and identically distributed random variables, thus being the default model for the accumulation of micro events which finally lead to a macro event. Accordingly, the Weibull distribution is not just a distribution providing a good empirical fit, but can be seen as justifiable for use in a plausible causative model of cancer genesis.
Fitting empirical incidence rates with scaled Weibull function. We performed extensive simulations and parameter fittings for the empirical incidence cuminc c (a) of cancer type c at age a using the scaled Weibull function: cuminc c (a) ≈ P c ⋅ Weibull(λ c , k c )(a). The model agrees excellently with age-dependent age incidences of 18 common solid cancers (R 2 > 0.99, Fig. 1).
Goodness of fit maintains when parameters P c and λ c , varying roughly two orders of magnitudes (Fig. 2). This finding suggests that many parameter combinations provide similar dynamics that are consistent with empirical data. So any interpretations of P c and λ c have to take into account this considerable uncertainty. Nevertheless, the estimates for P c are several orders of magnitude smaller than the realistic number of stem cells provided by Tomasetti and Vogelstein 8 , yielding evidence supporting the above statement that the number of independent local processes is not equal to the number of stem cells. The estimates of parameter k c , which corresponds to the number of driver events in the mechanistic model, are robust against variations of parameters P c and λ c (Fig. 2). Moreover, the estimates of k c are robust against race, sex, period and location (Fig. 3). In Supplemental Fig. 2 the best fits of shape parameters are plotted against the best fits of scale parameters for 694 time series.
Relationship between cancer incidence and stem cell divisions. Tomasetti and Vogelstein 8 suggested that the variation in cancer risk among tissues can be explained by the number of stem cell divisions. They reported that the tissue-specific cancer risk is strongly correlated (0.81) with life-time stem cell divisions (LSCD). These authors stated that the total number of stem cell divisions is a causative factor of cancer risk. This assumption yields a prediction on age structured data: for tissue type c the number of stem cell divisions up to age a, which we will denote by LCSD c (a), should then be strongly correlated with cuminc c (a). However, using age incidence data obtained from the SEER-database 27 we found that the regression lines for most tissue types c for age data of 40, 50, 60, 70, and 80 years of cuminc c (a) plotted against LCSD c (a) in a log-log-scale are much steeper than the ones of the regression lines for different c and cuminc c (80)-using 80 as average life span as was done by Tomasetti and Vogelstein 8 (see Fig. 4).  Overall, age-dependent stem cell divisions (using ages 40-80 years) is modestly correlated to age-dependent cancer risk for the 31 cancer types considered by Tomasetti and Vogelstein 8 using the SEER database and the estimates of stem cell divisions given therein (Pearson correlation coefficient ρ = 0.51).
Hence, the strong correlation for (life-time) tissue-specific cancer risk with life-time stem cell divisions (LSCD) cannot be explained by the simple causative factor (involving the product of the number of stem cells and the number of divisions of each stem cell) suggested by Tomasetti and Vogelstein 8 . A causal explanation on cancer risk should at least shows that the association between cancer risk and risk factor observed at overall level is reproducible on age stratified data. However, one caveat to such explanation, co-factors of risk factors might not be appreciated.
In our 3-parameter model, which gives good fits for age dependent cancer risks, several relations between the model parameters and cancer risks at a certain age can be observed. For instance in our parameter estimates good fits are possible when taking the inverse of the lifetime cancer risk P c ≈ 1/cuminc c (85). However, we will not suggest that the number P c of cell population at risk is an explanation for the variation of cancer risks among tissues: as the range of P c yielding good fits varies by two orders of magnitude and independently determining this number is difficult to achieve, a corresponding hypothesis is difficult to verify or to falsify.
As the sensitivity analysis for the scale parameter λ c (Fig. 2) shows that this parameter varies over several order of magnitudes, still yielding very good fits (R 2 > 0.99), the corresponding estimates for the mutation rate r in the tLIR model using the approximation (1) and relation (2) are also very uncertain, even when fixing values of S and d and leaving out the considerable uncertainty of these. Nevertheless, when using estimates of S and d taken from the literature 8 the obtained ranges of values of r using relation (2) for several cancer types do not intersect the range [10 −10 , 10 −6 ] of "plausible values" of r suggested by Wu et al. 11 . If we extend the analysis to allow "good fits" by setting a threshold R 2 > 0.85, then good fit of the tLIR model with r ∈ [10 −10 , 10 −6 ] are possible to achieve (Table 1).
Testing performance of our 3-parameter against other simpler model. For testing the performance of our 3-parameter model against other simpler models, we compared the fitting performance of the scaled Weibull function to that of 2-parameter power law model arising as the simplest instance from multistage theory 5,7,12,13 . The empirical time series for different locations, periods and races were fitted (694 time series all together) using both models, the power law model had a goodness of fit of R 2 < 0.90 for 90 time series (13.0%), R 2 < 0.95 for 257 time series (37.0%), and R 2 < 0.98 for 366 time series (52.7%). In contrast, our 3-parameter scaled Weibull model resulted in R 2 > 0.9 for all time series, R 2 > 0.98 for 686 (=98.8%) of the time series, and R 2 > 0.99 for 679 (=97.8%) of them ( Fig. 6(a)).
We compare the fitting performance of the scaled Weibull function against that of the scaled Gamma function. Although both functions fit data equivalently well in most cases, the scaled Weibull function outperforms the scaled Gamma function in several time series. (Fig. 6(b)) displays R 2 reporting goodness of fit for the two functions. We also calculate the Akaike information ciriterion(AIC) 28 , a likelihood based measurement. A lower AIC value indicates a better fit. Table 2 reports the AIC for 18 cancer types, the AIC values for Weibull function are lower than those of Gamma function in 15 cancer types. Estimating the Number of Driver Mutations for Cancer Onset. In our model the shape parameter k c reflects the number of mutations required for cancer onset. The values of this parameter are, however, higher than the number of mutations estimated from sequencing data by Vogelstein et al. 29 . Vogelstein et al. suggested technical issues as an explanation for the inconsistency between estimates from epidemiological data and sequencing data. Notably, our k c estimates and the number of driver mutations estimated from a classical power law model are roughly in the same numerical range (Fig. 7). Since we obtain better and more robust fits than the power law model, we believe that our estimated driver mutation numbers are more trustworthy.

Discussion
In this study we connected the mechanism-based cancer development tLIR model to the Weibull distribution function. We tested its validity by fitting a 3-parameter Weibull function to data from 18 common solid cancer types, consisting of more than 600 time series. The scaled Weibull function fits well with age dependent incidence curves of all studied cancers and outperforms other models, such as the commonly used 2-parameter power law model and a 3-parameter scaled Gamma function model. With the scaled Weibull function, we can estimate the number of driver mutations required for cancer onset in individual cancer types. To our knowledge, this is the first work matching pan-cancer incidence curves with a statistical distribution function that is partially biologically informative.
Compared to the tLIR model developed by Wu et al. 11 we see two technical benefits of our suggested approach: First, the scaled Weibull function involves less parameters than tLIR, but it remains to be biologically interpretable. The tLIR model includes several details of the multi-staged process of cancer development, e.g. the number of steps required for transforming a normal cell to malignancy, the number of stem cells in a tissue and division rate of stem cells. Although the tLIR model indeed provides useful insights into linking age-dependent somatic mutations to cancer risk, it has also limitations. For example, it ignores the effects of clonal expansion 25 . Another issue is that most parameters in the tLIR model are difficult to measure accurately in practice. Following Tomasetti and Vogelstein 8 the number of stem cell divisions can be estimated, but the accuracy has been criticized 30 . In contrast, our suggested model requires less specific assumptions about the parameters to be measured in practice. generations. *Cancers of which parameter estimates are impossible because division rate is 0. Moreover, the Weibull distribution is a special case of the generalized extreme value distribution (GEV) which is well connected to classical statistical approaches to describe rare events 26 .
Our analysis results mostly agree with those provided by Wu et al. 11 . They defined intrinsic cancer risk as the probability that one tissue transforms from normal to tumor because of accumulated mutations, and extrinsic cancer risk as 1-intrinsic cancer risk. They quantified upper bounds to intrinsic cancer risk by tLIR but did not properly fit tLIR to epidemiological data, concluding intrinsic factor insignificantly contributes to cancer. According to our understanding their argument mainly results from insufficient exploration of parameter space and implicitly assumes that all tissues require the same number of driver mutation to initiate cancer. Our results suggest that the contribution of extrinsic factors to cancer is overestimated by Wu et al. 11 . However, one should note that the excellent agreement between the scaled Weibull distribution function and empirical data does not necessarily exclude that in addition to intrinsic there are further extrinsic and unknown risk factors. In that context it is worthwhile to mention that our estimated number of driver mutations required for cancer onset differs from tissue to tissue. Although the exact number is not validated by biological experiment, this observation is consistent with findings in genetic studies 29 .
One interesting observation is that all non-reproductive tissues have a similar cancer risk accumulation pattern. Cancer incidence rates increase dramatically at about 40-50 years, peaking at about 55-70 years and then decrease. This pattern matches findings reported by Podolskiy et al. 31 . A question for future work is whether mutation load agrees with the scaled Weibull function or age-specific mutational signatures [32][33][34] . Another interesting observation is that testicular germ cell cancer incidence peaks at younger age compared to other cancer types, which might be explained by accelerated aging of testis 31 . Altogether we believe that our suggested approach provides insights into cancer development by providing a link between empirical data and a mechanism-based model.

Methods
Fitting cumulative cancer incidence with a model for theoretical intrinsic cancer risk (tLIR). Wu et al. 11 provided the following recursion formula to compute the chance that a single stem cell acquires k mutation hits after g divisions given a mutation rate r. given the initial cell state at generation 0: Here X g is accumulated driver mutations at generation g, i and j represents accumulated driver mutations at generation g and g + 1, respectively. A fully developed tissue with S stem cells must go through = + ⋅ n S d a log 2 rounds of divisions, assuming division rate is d and age a. With this transition probability (3), the theoretical lifetime intrinsic cancer risk (tLIR) is formulated as Although the recursion formula being dependent on more than one parameter cannot directly be solved in closed form by standard algorithmic techniques, it has nevertheless a simple closed form solution, which was derived by hand computations and verified by standard symbolic computations (using the computer algebra system Maple 2015.2): The formula for the age-parameterized theoretical lifetime intrinsic cancer risk (tLIR) hence has the following simple closed form solution, which allows much faster and hence more extensive computations and extends the range of admissible values of k from the positive integers to the positive real numbers: Notice that our result basically coincides with the one obtained by Calabrese and Shibata 35 that was obtained by a direct probabilistic reasoning.

Relating the tLIR model to a scaled Weibull function. We found a connection between
S d a k S log 2 and the scaled Weibull function P where P is the cell population at risk. To see this connection, we assume that  r 1. Then S d a k k 0 log 2 2 and using the Taylor series for log and exp, we obtain S 0 Comparing (7) and (9) we observe that these relations coincide if Using that relation and (8)  we obtain an unshifted one. This condition admits a transparent interpretation, namely, the number of stem cell divisions (for a fixed cell) should be more than the logarithm of stem cell number. Then we have that the tLIR incidence approximately equals to the scaled Weibull incidence if the parameters satisfy   period, location, histology and ICD (international classification of disease) code. These data were stored in ASCII file, we used the SEERaBomb R package to parse them into sqlite file facilitating data manipulation. Cancer names provided by Tomasetti and Vogelstein 8 can not be directly mapped into those in SEER database. We addressed this difficulty by two steps: first, annotate tumor primary site to (international classification of  disease-oncology 3) ICD-O-3 code based on the literal sense of site in Tomasetti and Vogelstein 8 ; second, annotate histology to ICD-O-3 code based on the literal sense of cancer histology by Tomasetti and Vogelstein 8 . For  instance, primary site of lung adenocarcinoma is lung, corresponding to ICD-O-3 site code: C340, C341, C342,  C343, C348, C349; adenocarcinoma of lung cancer corresponds to ICD-O-3 histology code 8140, 8141, 8143,  8147, 8570, 8571, 8572, 8573, 8574, 8575, 8576. The dictionary needed for mapping step (we call it ICD dictionary) can be found in http://seer.cancer.gov/icd-o-3/. Osteosarcoma definition can be found in ICD dictionary, it is a subtype of malignant bone neoplasm, corresponding ICD-O-3 histology code: 9180-9189. However, the ICD dictionary does not differentiate between osteosarcoma detected in the head, leg, or arm. The ICD9Data database (http://www.icd9data.com/) defines bone cancer using ICD9 code 1700-1709, bone cancer in head, arms, legs, pelvis using ICD9 code 1700, 1704-1705, 1707-1708, 1706 respectively. Head and neck squamous cell carcinoma involves tumors located in many sites, ICD dictionary fails to provide its definition. Liao et al. 36 provided ICD9 site code: 1400-1419, 1430-1499, 1600-1619, we then used ICD-O-3 histology code: 8070-8076, 8078 to select squamous cell carcinoma. More detailed cancer definitions using ICD code can be found in Table 3. Two hematopoietic cancers: acute myeloid leukemia and chronic lymphocytic leukemia, are defined using site recode ICD-O-3/WHO 2008 definition (http://seer.cancer.gov/siterecode/icdo3_dwhoheme/index.html).
Although we carefully annotated 25 cancer definitions using ICD code, we can not avoid misclassifications. because annotation needs several data sources of which information confidential levels differ from each other. The Cancer Genome Atlas (TCGA) program 37 is a flag project of cancer research hosted by National Institutes of Health, it provides comprehensive, high-quality molecular and clinical data. Cancer definitions are well annotated using ICD code in TCGA clinic documents. We therefore assume TCGA cancer definitions are precise and extracted definitions of 18 solid tumors (Table 4). With 18 cancer definitions, we selected patients who were diagnosed with cancer after 2000 from SEER-9 registries, SEER-4 registries, SEER-5 registries data to form SEER-18 registries data. As the highest time resolution of SEER data is 1 year, for each year, we took middle age for fitting models, for example, 0 year-old is modified as 0.5 years-old.
For robustness analysis of parameter estimates we classified each cancer into subgroups based on location, period and race, data of subgroups were separately fitted to the mathematical models.
Fitting the models to empirical cancer incidence data. As was done in previous work 38  as the metrics for goodness of fit, where x i and y i is empirical and model-derived cancer incidence respectively, x and y respectively denotes mean value of x and y. Results of fits are given in Table 1 showing that there are biologically reasonable parameter combinations that can yield good fits of the tLIR model for most cancer types.
Data availability. All data used in this study are publicly available. The sources are detailed in the section on methods.