The molecular basis of cancer is being unraveled by extensive sequencing efforts of cancer genomes such as those being undertaken by The Cancer Genome Atlas (TCGA). Revealing alterations in cancer genomes has led to a more comprehensive understanding of cancer biology, exposed therapeutic targets, and enhanced the classification of cancer types beyond phenotypes. However, the abundance of cancer mutations has made the identification of bona fide driver genes a nontrivial task, and it remains unclear whether we are close to identifying all driver genes for each cancer [1].

Given that the number of driver genes is bounded by the number of human protein coding genes, it is expected that nearly all clinically significant driver genes will be discovered by sequencing enough cancers and additional efforts would be excessive. In an earlier analysis, Lawrence et al. advocated that between 650 and 5300 samples for each cancer type will be needed to provide sufficient power to detect genes mutated in at least two percent of cancers [2]. In fact, their down-sampling tests showed that the number of driver cancer genes across cancer types grew nearly linearly with the number of samples sequenced. At the time, their data set reported the identification of up to 254 driver genes based on an analysis of 4742 samples across 21 cancer types available up to 2014. However, the 2018 TCGA analysis on 9423 samples across 33 cancer types revealed only 299 driver genes, falling short of the expected 50% increase in driver genes. This possibly suggests that we may already be approaching saturation in driver gene discovery efforts with much fewer specimens analyzed than previously thought to be needed [3].

To assess the relationship between the intensity of sequencing efforts and the number of driver genes identified, we plotted the number of samples sequenced by TCGA for different cancers against the number of driver genes identified from those samples by a consensus of at least three driver gene identification algorithms (Fig. 1). Driver gene algorithms are listed in Table 1 and discussed in detail by Cheng et al. [4] There was a statistically significant positive correlation across 33 cancer types (p = 0.001). A correlation coefficient of 0.53 indicated that 28% of variance in the number of driver genes between different cancer types is potentially explained by the number of samples sequenced. While there is no biological mechanism for the number of driver genes of different cancers to be related to sample size, it is expected that a correlation would exist when the number of driver genes initially increases with additional sampling. This association would then be expected to become non-significant, as the explained variance of driver genes by sampling diminishes with oversaturation. Thus, the explained variance of the number of driver genes for different cancer types by the number of samples may be informative of the degree of saturation in sequencing. Supporting this notion, the correlation between sample sizes and driver genes using the available TCGA data from 2014 with fewer samples was characterized by a correlation coefficient of 0.67, suggesting that 45% of variance in the number of driver genes between different cancer types was due to sample sizes at the time.

Fig. 1
figure 1

Sample size and the number of cancer driver genes identified for 33 cancer types. The grey line depicts a power law defined by a constant of 4.976 and exponent of 0.56

Table 1 Driver gene identification algorithms

Notably, the scaling relationship between sample size and cancer driver genes followed a power law distribution with a scaling exponent of 0.561. Similar results were obtained when the number of cancer driver genes was determined by at least two to four published computational algorithms widely used to annotate likely functional cancer mutations [4, 5]. A scaling exponent less than one indicates that with an increase in the number of samples sequenced across different cancer types the rate of driver gene discovery increases at a slower pace. This implies that sequencing efforts are already encountering decreasing rates of return in terms of discovering new driver mutations. Consistent with this, the power law relationship between sample sizes and driver genes using the 2014 TCGA data demonstrated a scaling exponent of 1.09. This supralinear scaling indicates that early sequencing efforts were associated with increasing rates of returns in the number of driver gene discovered. Although further genomic investigation of uncharacterized cancers and perhaps even some prevalent cancers would be fruitful, the decline in both the scaling exponent and explained variance with an increasing number of cancers sequenced supports the hypothesis that sequencing efforts are approaching saturation.

The analysis of different rates of cancer driver gene discovery between different cancers may aid the determination of which cancer types may benefit most from greater sample sizes. Cancer types lying above the plotted best fit line have a relatively faster driver gene discovery rate, implying that current sequencing efforts are likely not close to saturation. This must be cautiously interpreted as it is not expected that different cancer types will have the same number or distribution of driver genes which would impact the rate of discovery. Nonetheless, our plot suggests that outliers including diffuse large B cell lymphoma, lung adenocarcinoma, lung squamous cell carcinoma, and adrenocortical carcinoma may benefit from additional sequencing of specimens, which is congruent with prior power analyses and the greater tumor mutation burden of these cancers [2, 6].

It is undeniable that large scale genomic projects are valuable, but they are associated with very significant investments of money, time, and expertise. The opportunity costs of such investments will likely remain underestimated as research policies and funding in the past decade has been driven by the prevailing belief that more data is better [7, 8]. Even if the cost of genome sequencing is no longer prohibitive, the deluge of data from sequencing efforts is expected to mushroom requiring additional time, hardware, and expertise to further analyze and store. Our results argue that past strategies of indiscriminately sequencing as many specimens as possible for all cancer types is inefficient. In this time of austerity in research funding, it makes sense that we should be evaluating whether advancements from further sequencing efforts are truly innovative or merely incremental. In addition, unless it can be demonstrated that cancer genomics can alter or improve clinical practice, we run the risk of sequencing for the sake of sequencing rather than for meaningful patient benefit. Thus, it is essential that future research assess how genomic data can be broadly applied such as stratifying patients to targeted therapies, particularly as the sparse efforts to do so in large clinical studies thus far have resulted in ambiguous conclusions [9].