Distinct mutation accumulation rates among tissues determine the variation in cancer risk

Cancer is believed to be a result of accumulated mutations. However, this concept has not been fully confirmed owing to the impossibility of tracking down the ancestral somatic cell. We sought to verify the concept by exploring the correlation between cancer risk and mutation accumulation among different tissues. We hypothesized that the detected mutations through bulk tumor sequencing are commonly shared in majority, if not all, of tumor cells and are therefore largely a reflection of the mutations accumulated in the ancestral cell that gives rise to tumor. We collected a comprehensive list of mutation frequencies revealed by bulk tumor sequencing, and investigated its correlation with cancer risk to mirror the correlation between mutation accumulation and cancer risk. This revealed an approximate 1:1 relationship between mutation frequency and cancer risk in 41 different cancer types based on the sequencing data of 5,542 patients. The correlation strongly suggests that variation in cancer risk among tissues is mainly attributable to distinct mutation accumulation rates. Moreover, the correlation establishes a baseline to evaluate the effect of non-mutagenic carcinogens on cancer risk. Finally, our mathematic modeling provides a reasonable explanation to reinforce that cancer risk is predominantly determined by the first rate-limiting mutation.

we hypothesized that these mutations are largely the reflection of the mutations accumulated in the ancestral cell that gives rise to tumor. In other words, the majority of these mutations should have been accumulated before neoplastic transition, and therefore should be correlated with risk of cancer. This assumption is supported by theoretical reasoning and by recent studies evaluating the accumulation of pre-cancer mutations.
First, cancers caused by environmental factors support the idea that most mutations occur before neoplastic development. For example, lung cancer of smokers contains 10 times more mutations than nonsmokers 1 . Even the lung cancer of smokers having quit smoking more than 15 years contains ~3 times more mutations than nonsmokers (data based on TCGA dataset of lung adenocarcinoma) 2 , indicating that these smoking caused mutations can be preserved until tumorigenic transformation.
Second, cancer risk was found to be associated with stem cell divisions 3 , suggesting that normal cells have accumulated many somatic mutations during stem cell division that are sufficient for cancer development. This has been confirmed by single-cell studies. For instance, single-cell exome sequencing of normal kidney cells has revealed remarkable somatic mutations 4 , with a mutation frequency similar with that have been revealed previously in bulk tumors of kidney (~1.1 mutations per Mb) 5 .
Third, the fact that mutation frequency is significantly correlated with age supports the assumption that most mutations in bulk tumor occur before neoplastic development 6 . For example, thyroid cancers of patients with age ~80 contains 3 times more mutations than patients with age ~20 7 . If there is no difference from tumorigenic transformation till forming detectable tumor for both old and young thyroid cancer patients, then the increased mutations in old patients must occur before the tumorigenic transformation. Actually, modeling analysis of the correlation between mutation frequency and age indicates that more than half of somatic mutations identified in bulk tumors occur during pre-cancer phase 6,8 .
Finally, after tumorigenesis, new mutations in bulk tumor are accumulated along with each clonal expansion. Assuming that all solid cancers have experienced on average the same numbers of clonal expansion before the tumor reaching a detectable size, and assuming that the error rate of DNA replication during each cell division is the same for different cancers, the number of mutations accumulated after tumorigenic transformation should be similar for different cancers. Since cancers like rhabdoid cancer, Ewing sarcoma and medulloblastoma contain only 3~10 coding region mutations in bulk tumor 9-11 , we can conclude that the post-tumorigenesis clonal expansion accumulates at maximum 3~10 mutations in bulk tumor assuming all the mutations occurred after tumorigenic transformation. Therefore, for most cancers that typically contain 60~2000 mutations in bulk tumor, most mutations are less likely accumulated during clonal expansion. Interestingly, this is supported by an exhaustive mutation analysis of three hepatocellular carcinoma nodules from the same patient, which revealed that ~95% of mutations are common across different tumors 12 , suggesting in this case new tumor nodules resulting from different subclonal expansions have accumulated few new mutations (<5%; less than ten).
All together, these arguments suggest that the "consensus" mutations revealed by bulk tumor sequencing are indeed a reflection of the mutations accumulated in the ancestral cell. It is important to point out that all our conclusions derived from the correlation between the frequency of "consensus" mutations and cancer incidence are not influenced as long as the frequency of "consensus" mutation is proportional to, if not approximate to, the mutation frequency in the ancestral cell (see section of Mathematical modeling).

Data collection and processing
All the mutation frequencies are based on results of whole genome sequencing (WGS) or whole exome sequencing (WES). The average mutation frequencies of most cancers were collected from literatures directly, or in several cases calculated using the data form the literatures. Other cancers were not included largely due to the lack of data or too few samples of that cancer was detected by WGS/WES. When available, cancer lifetime incidences were obtained from Surveillance, Epidemiology and End Results (SEER) database (www.seer.cancer.gov) 13 and generated by their software DevCan 14 , or obtained directly from the previous study 3 . If the data were not available this way, we using the epidemiological statistics to estimate the lifetime incidence for a specific cancer. Details of data collection and processing for each cancer subtype are provided below in separate sections.

Acute myeloid leukemia (AML)
TCGA group analyzed 200 adult cases of de novo AML, using whole-genome sequencing (WGS) on 50 cases and whole-exome sequencing (WES) on 150 cases 15 . The mutation frequency detected by WGS was not significantly different with that detected by WES, and mutations were found to be randomly distributed throughout the genome, without significant difference between coding and noncoding region 15 . On average, 13 exonic mutations per AML sample were observed, corresponding to a mutation frequency of ~0.43 per Mb.

Adenoid cystic carcinoma (ACC)
A recent study sequenced the exome of 55 ACC samples and the genome of 5 ACC samples with matched normal DNA, and revealed approximately 0.31 mutations per Mb 16 .
Between 2007 and 2011, the number of new cases of all cancers was 460.4 per 100,000 annually, and the lifetime incidence of cancer is ~40.4% 13 . Given that the incidence rate of ACC is 0.4 per 100,000 people per year (1200 new diagnoses are made in the US per year) 13,17 , the lifetime incidence of ACC is 0.4/460.4·40.4% ≈ 0.035%.

Adrenocortical carcinoma (ADC)
A recent study sequenced 45 ADCs via WES, and revealed a mutation frequency of ~0.6 per Mb 18 .
ADCs are rare, with an annual incidence of 0.07-0.2 cases per 100,000 people 19,20 . Between 2007 and 2011, the number of new cases of all cancers was 460.4 per 100,000 per people, and the lifetime incidence of cancer is ~40.4% 13 . Given that the annual incidence is (0.07+0.2)/2 = 0.135 per 100,000, the lifetime incidence of ADC is about 0.135/460.4·40.4%≈0.012%.

Bladder cancer
A recent study sequenced 130 bladder tumors with matched normal samples via WES, and revealed a mutation frequency of ~7.7 per Mb 21 .
The lifetime incidence of bladder cancer is 2.4% 13 . The lifetime incidence of breast cancer for woman is ~12.3% (www.seer.cancer.gov) 13 .

Cholangiocarcinoma (CCA)
WES performed on 40 CCAs collected from two recent studies revealed an average of 36.8 somatic mutations per sample 25,26 , corresponding to a mutation frequency of ~1.2 per Mb.
The annual incidence of all cancers was 460.4 per 100,000, corresponding to a lifetime cancer incidence of ~40.4% 13 . Given that the annual incidence of CCA was estimated to be 1.67 per 100,000 in US 27 , the lifetime incidence of CCA is approximately 1.67/460.4·40.4% ≈ 0.15%.

Chromophobe renal cell carcinoma (chRCC)
TCGA group sequenced 66 chRCCs with matched normal samples via WES and revealed a mutation frequency of ~0.4 per Mb 28 .

Chronic lymphocytic leukemia (CLL)
A recent study analyzed 105 cases of CLL using WES, and reported that the somatic mutation frequency of CLL is ~0.9 mutations per Mb 32 . This mutation frequency is slightly higher than a former study of 91 CLL cases, which reported that the somatic mutation frequency of CLL is 0.72 ± 0.36 per Mb 33 . Therefore, combining the two studies together, we estimated the mutation frequency of CLL to be (105·0.9+0.72·91)/196 ≈ 0.8 per Mb.
The lifetime incidence rate of CLL is 0.52% 3 .

Clear cell (conventional) renal cell carcinoma (ccRCC)
TCGA group sequenced 417 ccRCCs with matched normal samples via WES and revealed a mutation frequency of ~1.1 per Mb 5 .
The lifetime risk of kidney and renal pelvis cancer is ~1.6% (www.seer.cancer.gov) 13 . About 90% cases of kidney and renal pelvis cancer are renal cell carcinomas 29,30 . CcRCC is the most common carcinoma of renal cell carcinomas, representing ~70% of cases 31 . Therefore, we estimate the lifetime incidence of ccRCC to be 1.6%·90%·70% ≈ 1.01%.

Cutaneous squamous cell carcinoma (CSCC) and Basal cell carcinoma (BCC)
Nonmelanoma skin cancer is the most common human malignancy [34][35][36] . A previous study analyzed eight primary Cutaneous squamous cell carcinomas (CSCCs) matched with normal tissue using WES, and revealed its somatic mutation frequency to be ~39 per Mb 37 , making it the most highly mutated malignancy among known cancers back then. Recently, a study detected the mutational landscape of 12 sporadic BCCs using WES, and found a mutation frequency of 75.8 per Mb of coding DNA, which is twice as much as SCC's 38 .
BCC is the most common form of skin cancer, with a lifetime incidence estimated to be ~30% 3 . The lifetime incidence of CSCC is estimated to be one fourth of BCC 39 , and thus is ~7.5%.

Diffuse large B-cell lymphoma (DLBCL)
WES performed on 49 DLBCLs revealed a mutation frequency of ~3.2 per Mb 40 .
DLBCL is the most common type of non-Hodgkin lymphoma. The annual incidence of non-Hodgkin lymphoma is 19.7 per 100,000 people (www.seer.cancer.gov) 13 , and DLBCL accounts for 36% (~7 per 100,000 people) of these cases 41 . The lifetime incidence of non-Hodgkin lymphoma is 2.1% 13 . Thus, we estimate the lifetime incidence of DLBCL is 2.1%·36% ≈ 0.76%.

Endometrial carcinoma (EDC)
A recent study performed WES on 14 endometrial tumors with matched normal samples, and revealed the somatic mutation frequency of 3.7 mutations per Mb 42 .
The lifetime incidence of GBM was estimated to be ~0.219% 3 .

Head and Neck squamous cell carcinoma (HNSCC)
A study on 74 HNSCC tumor-normal pairs revealed that the mutation frequency of HPV-positive tumors (n = 11) and HPV-negative tumors (n = 63) is ~2.28 per Mb and ~4.83 per Mb, respectively 47 . The difference on mutation frequency between HPV-positive and The lifetime incidence was estimated to be 7.935% for HNSCC infected with HPV-16, and 1.38% for HNSCC not infected 3 .

Hepatocellular carcinoma (HCC)
A recent study performing WES on 231 HCCs revealed that the mutation frequency of HCCs infected with hepatitis B (HBV) (n = 167) is ~2.0 per Mb, lower than non-HBV-related HCCs (n = 64, ~3.4 mutations/Mb) 50 . By dividing the non-HBV-related HCCs into HCCs infected with hepatitis C (HCV, n = 22) and neither-HBV-nor-HCV HCCs (NBNC, n = 42), we found the mutation frequency is ~3.34 for HCV-related HCCs, and ~3.51 for NBNC HCCs. These mutation frequencies are similar with a previous study performing WGS on 27 HCCs 51 .
The lifetime incidence is ~0.71% for HCC not infected with HCV and ~7.1% for HCC with HCV infection 3 .

Hereditary Non-polyposis Colorectal cancer (HNPCC)
HNPCA, also known as Lynch syndrome, is an inherited cancer due to the DNA mismatch repair (MMR) defect 52-54 . Patients of HNPCC present microsatellite instability (MSI) phenotype. According to the previous study of 15 MSI colorectal cancers using WES 55 , the mutation frequency of MSI colorectal cancer has been estimated to be ~47 per Mb, which resembles the high mutation frequency estimated for MMR-deficient cancers 56 .
The lifetime incidence of colorectal cancer for people with HNPCC genes has been estimated to be ~50% 3 .

Lung adenocarcinoma (LUAC)
A recent study analyzed 183 LUAC tumor-normal pairs, and revealed a mean exonic somatic mutation frequency of ~12.9 per Mb from smokers (n = 135) and ~2.9 per Mb from lifetime nonsmokers (n = 27) 57 , being consistent with the results of previous studies 58, 59 . Last year, TCGA group revealed a mutation frequency of 8.87 per Mb by analyzing 230 LUACs 2 . By separating the smokers and non-smokers of this cohort, we found the mutation frequency for smokers and nonsmokers to be ~10 per Mb (n = 137) and ~2.8 per Mb (n = 24), respectively.
Because of the similar sample size, we estimate the mutation frequency to be the average of the two previous studies: 11.5 per Mb and 2.85 per Mb for smokers and lifetime nonsmokers, respectively.
The lifetime incidences of LUAC for smokers and never smokers were estimated to be 0.45% and 8.1%, respectively 3 .

Lung squamous cell carcinoma (LSCC)
TCGA group sequenced 178 lung LSCCs with matched normal samples 60 . The mutation frequency of LSCCs is ~10.5 per Mb for smokers (n = 169). Mutation frequency for nonsmokers of this cancer was not used because of too few samples (n = 6).
The incidence frequency of LSCC is about 75% of the incidence rate of LUAC in population 61 . Therefore, we estimate the lifetime incidence of LSCC to be 8.1%·75% ≈ 6.1%.

Medulloblastoma
Mutation event is less frequent in medulloblastoma than in most other solid tumors 11,62 . By analyzing 92 primary medulloblastoma-normal pairs using WES, a study revealed that the somatic mutation frequency of medulloblastoma is ~0.47 per Mb 11 . A subsequent study revealed a mutation frequency of ~0.52 per Mb based on 39 tumor-normal pairs using WGS 63 .
The lifetime incidence of medulloblastoma has been estimated to be 0.011% 3 .

Neuroblastoma (NBM)
WES performed on 81 NBM tumors revealed a mutation frequency of ~0.7 per Mb 75 .
The number of new cases of all cancers was 460.4 per 100,000 annually, and the lifetime incidence of cancer is ~40.4% 13 . The annual incidence rate of NBM was 7.7 cases per million over the last three decades 76 . Thus, we estimate that the lifetime incidence of NBM is 7.7/4604·40.4% ≈ 0.068%.

Non-papillary Gallbladder adenocarcinoma (GBA)
A recent study identified the somatic mutations for 57 tumor-normal pairs of GBA using WES, and revealed a ~1.42 per Mb mutation frequency 77 .
The lifetime incidence of non-papillary GBA has been estimated to be ~0.28% 3 .

Ovarian cancer
TCGA group sequenced 394 ovarian tumors with matched normal samples via WES, and revealed a mutation frequency of ~2.08 per Mb 78 .

Pancreatic ductal Adenocarcinoma (PDAC)
A recent study performed WES on 15 PDACs with matched normal samples, and revealed the mutation frequency to be ~2.7 per Mb 79 .
The lifetime incidence of PDAC has been estimated to be ~1.36% 3 .

Prostate cancer
A recent study integrated the sequencing data of 81 prostate tumors from TCGA project and 141 prostate tumors from previous studies, and revealed a mutation frequency of ~0.83 per Mb 44 .

Rhabdoid cancer (RHC)
WES performed on 32 primary RHC tumors with matched normal peripheral blood DNA revealed a mutation frequency of ~0.19 per Mb (range 0-0.45) 9 .
The annual incidence of cancer overall is 460.4 per 100,000, corresponding to a lifetime incidence is ~40.4% 13 . The average age-adjusted annual incidence of RHC is 0.07 per 100,000 people 80 . Thus, we estimate the lifetime incidence of RHC is 0.07/460.4·40.4% ≈ 0.0061%.

Small cell lung cancer (SCLC)
A recent study performed WES on 29 SCLCs with matched normal sample, and revealed a

Small intestine neuroendocrine tumor (SINT)
A recent study detected 48 SINTs using WES and revealed that its somatic mutation frequency is very low, at an average ~0.1 per Mb in the exome 83 .
The annual incidence of cancer overall is 460.4 per 100,000, corresponding to a lifetime incidence is ~40.4% 13 . The average annual incidence of SINT is 0.85 per 100,000 people 84 .

Stomach Adenocarcinoma (STAD)
A The lifetime incidence of medullary THCA has been estimated to be 0.0324% 3 .

Mathematical modeling
Here we first introduce how the mutation rate is associated with the probability of the first rate-limiting event (driver gene mutation) and then show the consistence between its derivation and some important cancer behaviors.
Linear correlation between accumulated mutation rate and the probability of the first rate-limiting step. Assume that in a normal tissue the mutation rate of the genome is constant in time and let µ represents the mutation rate per unit interval of time before the first rate-limiting step. Then after time t, the cell genome have accumulated on average !" mutations per unit base pair length (accumulated mutation frequency), and the probability of an initiating driver gene to mutate (rate-limiting step) is determined by !" and the base pair length, L, of this gene: where G is the length of the genome and (1 − ! ! ) !"# represents the probability of this gene keeping intact after the genome has accumulated !!" mutations.
Given that L≪G, this indicates ! ! can be modeled by an exponential function ! ! = !1 − ! !!"# , which is approximately equal to !"# by assuming that !"# is smaller than 1 by many magnitudes. In logarithm scale, This indicates that the probability of the first rate-limiting step is determined by accumulated mutation frequency µ! and the length of the driver gene. This is consistent with our assumption that the accumulated "consensus" mutations in bulk sequencing (determined by µ!) represent the probability of the first rate-limiting mutation to initiate the preneoplastic growth.
Modeling cancer incidence. One of the earliest theories of tumorigenesis that treated cancer as a stepwise progression was based on the observation of age-specific cancer incidence.
Explanations of age-specific cancer incidence, which date back to the work of Muller and Nordling half-century ago 95,96 , conceives the now widely held idea about tumor growth being initiated by the driver mutation, and constitutes the basis of the classic Armitage-Doll model 97 . Now assume that for a progenitor cell evolving to a clinically meaningful tumor, ! ensuing independent driver mutations are also required. According to the Armitage-Doll model, the cancer incidence is given by In this model, cancer incidence (! ! ) is determined by the probability of the initiating rate-limiting step (! ! ) and ensuing steps (! ! ⋯ ! ! ) per unit time interval, and increases with a power of age (t) that reflects the number of ensuing steps (n) necessary to develop a clinically meaningful tumor. In logarithm scale, we obtain the widely known equation for age-dependent cancer incidence, This equation is widely used to explain the age-specific cancer incidence of a specific cancer type in log-log coordinates, where n indicates the slop of the age-dependent incidence increasing with age t. A large value of n means a relatively higher probability of cancer risk at old ages than at young ages.
However, in our study, we focused on the lifetime incidence of cancer by disregarding the age-specific behavior of cancer incidence (that is, assuming a constant t as the lifetime). As we have assumed, the mutation frequency in our measurements should correlate with the frequency of mutation accumulation before the preneoplastic growth, as well as the probability of the first rate-limiting mutation. The correlation between incidence and accumulated mutation frequency can be given by Here, log ! ! = log !!" is a reasonable explanation why the slope of the regression line between incidence and accumulated mutation frequency is approximately 1, and includes competing risks that can cause the variation of each data point from a straight line. Thus, the strong correlation between mutation frequency and cancer incidence suggests that the first rate-limiting step has outcompeted other competing factors and determines the majority of cancer incidence.

Consistence between the modeling and cancer behaviors. Our conclusion is based on a
reasonable assumption that the accumulated "consensus" mutations in bulk sequencing (!") represent the mutations of the ancestral cell and thus correlates with the probability of the first rate-limiting mutation, whereas ensuing steps during clonal evolution act as competing risks that are relatively independent with !". To further investigate this, we show the consistence of this assumption with the overall behaviors of cancer.
We first consider the extreme cases where some disastrous accidents would likely have caused the initiating mutation of many sufferers. The incidence of a given cancer type in these individuals at age t, assuming the independent ensuing steps, would be given by where ! represents the percentage of survivors with initiating mutations having caused by the accident. Then the excess relative risk which decreases with t. Consist with this model, the excess relative risk of cancer, that shows the decreasing power function of time, indeed has been found by studying members of the Life Span Study (LSS) cohort of Hiroshima and Nagasaki atomic bomb survivors 98,99 , and other cohorts once experienced exposure of ionizing radiation 100 . On the contrary, if the accumulated mutation by radiation is responsible for all the rate-limiting steps with same effect, a similar behavior of age-specific incidence would be expected and the excess relative risk would be time independent.
We next consider tobacco smoking that can increase the probability of the first rate-limiting mutation and increase cancer incidence. Assume that smoking behavior adds an additional risk factor, !, to cause the initiating mutation. Then according to our modeling, the incidence of a given cancer type of smokers would be given by: where Κ!is intensity of smoking (cigarettes per day). Then, is a linear function of Κ. This is indeed consistent with the overall behavior observed for relative risk of lifetime lung cancer incidence vs. smoking intensity 101,102 . On the contrary, if considering an equivalent effect of smoking on all the rate-limiting steps, such a multistep process would impose a power-law function of Κ, It is noteworthy that, by assuming that accumulated mutation reflects the probability of the first rate-limiting mutation, our model is robust to include other hypotheses such as, allowing m to have a distribution. It is also possible to include in the model the mutation accumulation from temporal exposure to environmental mutagens. For example, cancer incidence following the exposure to a dose D of mutagen at some point during lifetime can be given by: and where !(!) is the function determining the dose dependent effect of mutation induction 103 .
This predicts a decreasing power function of ! for the relative risk after temporal exposure to mutagens. In fact, the prediction has been observed in the studies of the lung cancer relative risk after cessation of smoking 101,104 . Thus, our model is consistent with these data.
Discussion. It's easy to see, from the above modeling, that our analyses do not require the probability of the first rate-limiting step accurately equal to the "consensus" mutation frequency in tumor bulk, but require it being proportional to it. Although ensuring rate-limiting steps that determined latent period and age-specific behavior of cancer are included in the analyses, they are treated as competing factors in our modeling, which is reasonable given that the consideration of lifetime has disregarded the latent period and age-specific behavior of cancer risk. Our modeling is deliberately oversimplified comparing to the true complexity of tumorigenesis. However, simple models, such as the Armitage-Doll model, have been proven very useful in providing novel insights into tumorigenesis. Overall, despite the simplification, our modeling is surprisingly consistent with some important behaviors of cancer.     Table S1. Number of stem cell divisions for different tissues are obtained from the previous study [9]. A. Correlation between stem cell division, mutation frequency and lifetime risk in 2-D space. As a reference, lifetime risk of cancer (denoted as red nodes) is also plotted against the number of stem cell divisions with y-axis labeling on the right. A line connecting the mutation frequency and the lifetime risk of the same cancer is plotted as a reference of inconsistence of the two variables when predicted by stem cell division. Inset shows the Pearson correlation coef�icient between variables in the original scale and log-log scale. B. Correlation between stem cell division, mutation frequency and lifetime risk in 3-D space.