Comparing the current short-term cancer incidence prediction models in Brazil with state-of-the-art time-series models

Bouzon Nagem Assad, Daniel; Gomes Ferreira da Costa, Patricia; Spiegel, Thaís; Cara, Javier; Ortega-Mier, Miguel; Monteiro Scaff, Alfredo

doi:10.1038/s41598-024-55230-2

Download PDF

Article
Open access
Published: 25 February 2024

Comparing the current short-term cancer incidence prediction models in Brazil with state-of-the-art time-series models

Daniel Bouzon Nagem Assad^1,2,
Patricia Gomes Ferreira da Costa¹,
Thaís Spiegel¹,
Javier Cara²,
Miguel Ortega-Mier² &
…
Alfredo Monteiro Scaff³

Scientific Reports volume 14, Article number: 4566 (2024) Cite this article

1093 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The World Health Organization has highlighted that cancer was the second-highest cause of death in 2019. This research aims to present the current forecasting techniques found in the literature, applied to predict time-series cancer incidence and then, compare these results with the current methodology adopted by the Instituto Nacional do Câncer (INCA) in Brazil. A set of univariate time-series approaches is proposed to aid decision-makers in monitoring and organizing cancer prevention and control actions. Additionally, this can guide oncological research towards more accurate estimates that align with the expected demand. Forecasting techniques were applied to real data from seven types of cancer in a Brazilian district. Each method was evaluated by comparing its fit with real data using the root mean square error, and we also assessed the quality of noise to identify biased models. Notably, three methods proposed in this research have never been applied to cancer prediction before. The data were collected from the INCA website, and the forecast methods were implemented using the R language. Conducting a literature review, it was possible to draw comparisons previous works worldwide to illustrate that cancer prediction is often focused on breast and lung cancers, typically utilizing a limited number of time-series models to find the best fit for each case. Additionally, in comparison to the current method applied in Brazil, it has been shown that employing more generalized forecast techniques can provide more reliable predictions. By evaluating the noise in the current method, this research shown that the existing prediction model is biased toward two of the studied cancers Comparing error results between the mentioned approaches and the current technique, it has been shown that the current method applied by INCA underperforms in six out of seven types of cancer tested. Moreover, this research identified that the current method can produce a biased prediction for two of the seven cancers evaluated. Therefore, it is suggested that the methods evaluated in this work should be integrated into the INCA cancer forecast methodology to provide reliable predictions for Brazilian healthcare professionals, decision-makers, and oncological researchers.

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Principal component analysis

Article 22 December 2022

Theoretical background

A time series is a sequence of time-oriented observations related to forecasting or controlling a specific variable¹. This thematic study originated in 1927, adopting a general approach to time series analysis². Nearly three decades later, new time series forecasting approaches began to emerge.

Initially, classical time series statistical models were proposed³. Subsequently, these models were refined to include exponential smoothing techniques^4,5 before evolving into auto-regressive moving average models⁶. Eventually, they progressed further to incorporate Machine Learning⁷ and State-Space models⁸.

In all instances, the predictability of future events is a central element, crucial for planning and processes related to Operations Management, among others, such as Marketing, Economics, and Demography¹. However, the predictability of an event or quantity depends on various factors, including an understanding of the influencing factors, data availability, future and past similarities, and the potential impact of forecasts on the predicted outcome⁹.

In the context of oncology studies, mortality and incidence projection methods were already compared in Canada, using age-period-cohort (APC), auto-regressive time series, and space-state models at least for ten cancer types¹⁰.

APC and Bayesian APC, auto-regressive integrated moving average (ARIMA) time series, and simple linear models were also compared for five cancer types in Switzerland¹¹.

Using reported breast cancer cases in the Fijian population from 1995 to 2016, Chand et al.¹² attempted to apply an ARIMA model to provide a 12-month ahead prediction. However, faced with non-stationary data according to the Augmented Dickey-Fuller test, a linear regression model was chosen. The proposed model was compared with the Naive Forecast Method, showing that the linear regression model outperformed the Naive Forecast Method.

Also exploring the epidemiological characteristics of breast cancer, Lin et al.¹³ used Exponential Smoothing (ETS) and Autoregressive Integrated Moving Average (ARIMA) models to forecast breast cancer incidence in China.

Regarding palliative cancer care, two different long short-term memory (LSTM) models were proposed, aiming to forecast the patients’ next visit day and estimate the total patient demand 1 week ahead¹⁴. For this, was take into account their requirements, demographics, and each service history profile.

Alrobai and Jilani¹⁵ also applied LSTM to forecast the incidence of the three most prevalent cancers in Saudi Arabia. However, it’s crucial to note that cancer prevalence can significantly vary from one country to another.

In Malaysia, to deal with the continued annual growth in cancer incidence rates, particularly female breast, colorectal, and lung cancer, Lazam et al.¹⁶ tested ARIMA and Exponential Smoothing (ETS) models. They intended to determine the best rates for incidence prediction for these mentioned types of cancer.

Tudor¹⁷ proposed alternative ways to forecast cancer incidence and mortality by connecting population web-search practices with health variables officially published by Romanian authorities. The applied models included ARIMA, the Exponential Smoothing State-Space Model with Box-Cox Transformation, ARMA Errors, Trend, and Seasonal Components, and a feed-forward neural network nonlinear autoregression model.

In this research, conducted in Brazil, we present the framework to evaluate previous works on cancer time-series prediction, dividing the time-series prediction according to Hyndman and Athanasopoulos⁹ into Classical Statistical models, State-Space models, and Machine Learning models (Table 1). For this, only researches that makes cancer predictions were considered.

Table 1 Forecasting model applied by cancer type.

Full size table

After comparing ten previous works related to cancer incidence prediction (Table 1), we can conclude that:

1.
Breast and lung cancer incidence predictions have garnered more attention in specialized literature and have been studied in 8 and 7 works, respectively; colorectal cancer has been studied in 5 works, while other cancer types have been studied in 4 works or less.
2.
CSM and particularly ARIMA were the most used approaches.
3.
Considering SSM and MLM, TBATS NNETAR, and MLP were never covered before in previous research.
4.
We found no previous work in which all three classes of models were applied.

As will be presented in this paper, the third and fourth conclusions allow us to state that this work covers a gap in current cancer prediction. Thus, applying unseen methods (3rd) and the three classes of models (4th) to cancer prediction is an original contribution of this research.

Finally, the mentioned studies address the application of different forecasting methods in countries such as Canada, Switzerland, Fiji, China, Malaysia, and Romania. Their use in Brazil, for a larger sample of types of cancer and comparing them, seems like a complementary contribution.

Methods

Data collection

In this research, we analyze real cancer data from Brazil obtained from INCA. All time-series used are presented in Fig. 1 and are also available at Table 2. The seven cancer types evaluated are: Breast cancer (ICD-10 C50), Colorectal Cancer (ICD-10 C18 to C21), Prostate cancer (ICD-10 C61), Lung cancer (ICD-10 C33 and C34), Cervical cancer (ICD-10 C53), Head and Neck Cancer (ICD-10 C00 to C10) and Childhood Cancer (ICD-10 C00 to C96).

Table 2 ICD-10 Mortality rate by 100,000 inhabitants considering world population-adjusted by cancer type.

Full size table

The filters employed for each type of cancer can be found in Table 3. We gathered data on the mortality rates for Brazilian cancer from INCA’s website²¹. The population figures were obtained from the 2022 Brazilian census²².

Table 3 Filters and criteria used to retrieve cancer data by cancer type.

Full size table

In Brazil, the cancer incidence is not registered to all districts. So, INCA works with an approximate incidence inferred from the mortality rate considering Black et al.²³, Ferlay et al.²⁴ and Ferlay et al.²⁵ estimation methodologies based mainly on the I/M ratio.

The mentioned methodologies links the unknown incidence rate-adjusted (IRa) to the known mortality rate-adjusted (MRa) of some district by the equation $IRa = MRa*(I_R/M_0)$. Where $I_R$ refers to known incidence of districts geographically near from the targeted unmeasured district and $M_0$ refers to number of deaths of the same districts. The results of $I_R/M_0$ ratio to the unmeasured district evaluated is presented in Table 4.

Table 4 I/M ratio by cancer type.

Full size table

In Fig. 1 we present for each type of studied cancer the mortality rate by world population-adjusted by 100,000 inhabitants. Then, according to^23−25,24, estimation methodologies the expected incidence rate-adjusted IRa can be obtained by multiplying the mortality rate-adjusted in Fig. 1 and the values presented in Table 4 to each cancer type.

The current short-term predictions in Brazil rely on the average of the past 3 recent years. This outcome serves as a reference for the Brazilian public health system over the next 3 years. In essence, the existing approach is a simple moving average (MA).

Forecasting models applied

In this research we apply the univariate forecasting methods available in Hyndman and Khandakar²⁶, Petris²⁷ and Kourentzes²⁸. Models applied in next sections are presented in Table 5. These models were implemented in R²⁹ language (version 4.1.3) and the code used is available at Supplementary Material (Forecasting code.R).

Table 5 Forecasting models applied in this research.

Full size table

To build each model is necessary to estimate many parameters, but the main features of each model are presented forward:

ETS: ETS is a class of models that essentially works with three components equations level ($l_t$), trend ($b_t$) and season ($s_t$) to explain the original time series variable ($y_t$) that we aim to forecast. In each model these components cannot be significant, also known as None (N) or can be significant and better described $y_t$ as Additive (A) or Additive Damped (Ad) or Multiplicative (M) features. This class of models can be combined in 18 different ways (Fig. 2). For more details see Hyndman and Athanasopoulos⁹.
ARIMA: ARIMA or Seasonal ARIMA (SARIMA) is a class of models that combine autoregressive (AR) and moving average (MA) with differenced values. The AR part of ARIMA (p) shows that the time series is regressed on its own past data. The MA part of ARIMA (q) indicates that the forecast error is a linear combination of past respective errors. The I part of ARIMA (d) refers to differenced values of d order to obtain stationary time-series in which ARMA model approach can be applied Kotu and Deshpande (2019)³⁰. The difference between ARIMA and SARIMA models remains on the same components appearing lagged by the length of seasonal time window (frequency) as P, D and Q. For more details see Hyndman and Athanasopoulos⁹ and Kotu and Deshpande³⁰.
Kalman filter (KF): KF methods search the smallest vector that summarizes the past of the system that better describes the state of a deterministic dynamic system³¹. KF equation is basically composed by a linear autoregressive equation ${x(t)} = A*x(t) + W(t)$ where $W(t) \approx N(0,Q)$ with a measurement that is ${y(t)} = C*y(t) + V(t)$ where $V(t) \approx N(0,R)$ that defines the linearized process in which $y(t) \in {\mathbb {R}}$. The random variables W(t) and V(t) are assumed to be independent of each other and both must follow a normal distribution.
TBATS: TBATS model is Trigonometric Seasonal (T) Exponential Smoothing Method + Box-Cox Transformation + ARMA model for residuals (BATS). Equations of the TBATS model are presented in equations below where $\omega$ and $\phi$ are Box-Cox and the damping parameters respectively, ARMA(p, q) process model the error and $m_1$ to $m_J$ list the seasonal periods used while $k_1$ to $k_J$ are the corresponding number of Fourier terms used. For more details see De Liveira et al.³².
$$\begin{aligned} y_{t}^{(\omega )}= & {} \frac{ y_{t}^{(\omega )}-1}{\omega }, \omega \ne 0,\\ y_{t}^{(\omega )}= & {} \log {y_{t}}, \omega = 0,\\ y_{t}^{(\omega )}= & {} l_{t-1}+\phi *b_{t-1}+\sum _{i=1}^{t} s_{t-m_i}^{i} +d_t,\\ l_{t}= & {} l_{t-1}+\phi *b_{t-1} +\alpha *d_t,\\ b_{t}= & {} (1-\phi )*b_t +\phi *b_{t-1}+\beta *d_t,\\ s_{t}^{i}= & {} s_{t-m_i}^{i} +\gamma _i *d_t,\\ d_{t}= & {} \sum _{i=1}^{p} \phi _i*d_{t-i}+\sum _{i=1}^{q} \theta _i*\epsilon _{t-i} +\epsilon _{t},\\ s_{t}^{i}= & {} \sum _{j=1}^{k_j} s_{j,t}^{i},\\ s_{t}^{i}= & {} s_{j,t-1}^{i}*\cos {\lambda _j^i} + s_{j,t-1}^{*i}*\sin {\lambda _j^i} + \gamma _1^i*d_t,\\ s_{t}^{*i}= & {} s_{j,t-1}^{i}*\sin {\lambda _j^i} + s_{j,t-1}^{*i}*\cos {\lambda _j^i} + \gamma _2^i*d_t\\ \end{aligned}$$
NNETAR: Neural Network Time Series Forecasts (NNETAR) is a class of feed-forward neural networks with a single hidden layer and lagged inputs. This model works with 2 (for non seasonal time-series) or 3 (for seasonal time-series) parameters: the number of past observations used as input layers (p), the number of past observations lagged by the length of seasonal time window used as input layers (P) and the number of neurons (k) in the single layer. In this research, a total of 20 repeats networks are fitted, each with random starting weights. These are then averaged when computing forecasts. The network is trained for one-step forecasting. Multi-step forecasts are computed recursively. The k selected to each type of cancer it the half of the number of input nodes plus 1. For non-seasonal data, the fitted model is denoted as an NNAR (p, k) (Neural Network Autoregressive) model which is analogous to an AR (p) model but with nonlinear functions. For seasonal data, the fitted model is called an NNAR (p, P, k)[m] model, which is analogous to an ARIMA (p, 0, 0)(P, 0, 0)[m] model but with nonlinear functions. For more details see Hyndman and Athanasopoulos⁹.
MLP: MLP is an extension of feed-forward neural network where an arbitrary number of hidden layers that are placed in between the input and output layer (the truly computational engine of the MLP). According to Kourentzes et al.³³, MLPs are designed to approximate any continuous function and can solve problems which are not linearly separable. In our case, the time-series problem proposed our input layer (like NNETARs’ model p) are the most recent past observations and we set the MLP model to choose the best number of input layers between 1 and the prediction length (3 years) lags will be used according to Mean Square Error. The same criteria were also adopted to choose the number of hidden nodes in each hidden layer. For more details see Kourentzes et al.³³.

Forecasting models evaluation

The dataset presented in Table 2 were multiplied by I/M ratio for each cancer type shown in Table 4 to estimate the incidence rate of each type of cancer evaluated (Fig. 3).

For instance, to Breast cancer, the ICD-10 Mortality rate by 100,000 inhabitants are 17,77 in 1979, 21.73 in 1980 and so on (second column Table 2). Thus, the Breast cancer Incidence rate-ajusted will be these values multiplied by 5.59 (Breast cancers’ $I_R/M_0$ ratio in Table 4) which are 77.65 in 1979, 94.96 in 1980, 69 in 1981 and so on that can be seen in Fig. 3.

In this research, we are interested in provide a comparison between Brazilian’s current short-term cancer prediction and the time-series state of art models. As mentioned in Section Theoretical Background, as long as the current short-term cancer prediction are made 3 years ahead, we split our dataset into training data (from 1979 to 2017) and test data (from 2018 to 2020).

Training (in sample) and test (out of sample) data are evaluated using the Root Mean Square Error (RMSE) criterion. A low RMSE in sample value indicates a good average fit of the model used while a low value of RMSE out of sample indicates that the model used, on average, delivers a reliable forecast⁹.

Below we present the criteria adopted to evaluate the current and proposed methods predictions to each cancer type:

The noise evaluation over the training (in sample) data according to the following tests: student (ST), normality (NT), Auto-correlation function (ACF) plot and Breusch-Pagan (BPT);
The error evaluation according to the test (out of sample) Root Mean Square Error (RMSE).

If the residuals produced a 0 mean error in Student-test, follows a normal distribution in Shapiro–Wilk test, remains between the interval defined by the blue lines in ACF plot test to all lags and presented no constant variance all over the time (homoscedasticity) in Breusch-Pagan test, we consider that the model residuals produced a white noise which means that the model is unbiased^{34,35,36,37,38}.

The significance level adopted in this research is 0.05 which means that residuals produced a white noise if the obtained p-values in each test are higher than 0.05 to each model.

Thus, in this research we consider that the best model for each cancer type is given by their residual evaluation that (1) fulfill all requirements previously presented and (2) obtained the lowest out of sample RMSE.

Results

In this section we apply the methods presented in columns of Table 5 to each type of cancer incidence presented in Fig. 3. In Table 6 we summarize the in sample and out of sample RMSE results by model and type of cancer.

Table 6 RMSE per type of cancer per model.

Full size table

As mentioned in Forecasting models evaluation section, to compare models errors summarized in Table 6 we select the out of sample RMSE criterion. Then, to ensure that models residuals give us a white noise in the training data we apply the Student test (Table 7), the ACF plot, the Shappiro-Wink normality test (Table 8) and the Breusch-Pagan test (Table 9).

Table 7 Student test p value per type of cancer per model.

Full size table

Table 8 Normality test p value per type of cancer per model.

Full size table

Table 9 Breusch—Pagan test p value per type of cancer per model.

Full size table

As mentioned in Section Forecasting models evaluation, besides considering RMSE criteria we must also evaluate if each model produced residual values with a white error noise taking into account their auto-correlation plots and normality test to all cancer types (Table 6).

This evaluation is presented for all types of cancer evaluated, grouped (Figure 4) and individually—breast (Figure 5), colorectal (Figure 6), prostate (Figure 7), lung (Figure 8), cervical (Figure 9), head and neck (Figure 10) and childhood (Figure 11).

The white noise failure evaluation by model and by cancer type is summarized in Table 10.

Table 10 White noise failure evaluation summary per type of cancer per model.

Full size table

Considering the criteria presented in Section Forecasting models evaluation to ensure an unbiased model, we must select the best model to each type of cancer evaluated discarding the result of the following failed (biased) models for:

Current model, ETS, ARIMA, TBATS and KF to breast cancer which failed in Auto-correlation function (ACF) plot presented in Fig. 5 and, in normality test, MLP failed.
MLP to colorectal cancer which failed in ACF plot presented in Fig. 6, Breusch-Pagan test and in normality test. NNETAR also failed in Breusch-Pagan test.
NNETAR to prostate cancer which failed in Auto-correlation function (ACF) plot presented in Fig. 7 and MLP failed in normality test.
KF to lung cancer which failed in student test, ACF plot presented in Fig. 8 and, in normality test and ACF plot, MLP failed.
Cervical cancer presented residuals produced a significant ACF plot only to current model as presented in Fig. 9. MLP failed in normality test.
ARIMA to head and neck cancer which failed in ACF plot presented in Fig. 10 and, in normality test, KF and MLP failed.
MLP to childhood cancer which failed in normality test.

Thus, the best model to each cancer type are: NNETAR for breast, KF for colorectal, ARIMA for prostate, TBATS for lung, KF for cervical, the current method for Head and neck and KF for childhood.

Their prediction plots can be seen respectively in Figs. 12, 13, 14, 15, 16, 17 and 18. The 3-year ahead prediction values are summarized in Table 11

Table 11 Three years IRa prediction using the best model to each cancer type.

Full size table

Discussion

A limitation of this research could be observed in the method used to obtain the incidence of cancer in Brazil. This occurs because, in practice, the incidence is not measured. Thus, we used cancer incidence estimation methodologies proposed in Black et al.²³, Ferlay et al.²⁴ and Ferlay et al.²⁵ which are based on the mortality rate discussed in Section Data collection.

Considering that the presented methodologies can give us the best cancer incidence estimation evaluating only time-series univariate models, our findings in Table 6 seem to indicate that the current model applied by INCA in Brazil to forecasting cancer incidence underperform in 6 of the 7 type of cancers proposed in this research. So, the presented methodologies seem to behave more adequately than the Brazilian’s current methodology.

It is important to note that we are working with the same type and amount of data that is used today, meaning that it would not be necessary to collect new variables in order to increase the accuracy of the forecast.

In addition, we did not see the CSM models outperform the others in any type of cancer, although ARIMA models (CSM) are the most widely used models in the current literature so far as we presented in Table 1.

These facts imply that, while there is no broad and reliable Population-Based Cancer Registries in the country, all research that use these data as a primary source will be limited; including this one.

However, it is necessary to consider that Brazil has continental dimensions and a technological backwardness that do not facilitate the implementation of this type of record. Although restrictive, the fact has not prevented research and public policies aimed to cancer prevention and control in the country, that surely could be more effective.

In this sense, we reinforce that it is not possible to invalidate what has been done in the country, but to plead for the opening of space so that new, more accurate forecast models can be adopted, aiming at supporting strategic decisions to face cancer in the country. Even because the current literature has used models that go in the opposite direction of the results presented by this research in Table 1.

For instance, MLM models were only used in Soltani et al.¹⁴ and Alrobai and Jilani¹⁵ works and only LTSM were evaluated. Considering SSM, the current literature presents only Lee et al.¹⁰ research in which only KF approach is proposed.

In Table 11, we see that SSM (KF and TBATS) was selected in four of seven type of cancers evaluated while MLM (NNETAR), CSM (ARIMA) and current method where selected to one type of cancer.

The evaluation process adopted in this research and presented in Section Forecasting models evaluation was crucial to identify and discard biased models to each type of cancer. If we had only considered in sample RMSE criterion (measuring the best fitted model, on average) to select the models to each type of cancer, MLP would be selected in all time-series evaluated.

On the other hand, if we considered only out of sample RMSE criterion (measuring the best predicted values, on average), ARIMA and MLP would be selected in two types of cancer while ETS, TBATS and KF would be selected in only one type of cancer time-series (NNETAR and current method would not be selected).

The noise evaluation process adopted also allowed us to state that the current model can potentially provide a biased prediction because it failed in ACF plot to Breast and Cervical cancer as we can see in Fig. 4. Therefore, we cannot classify it as statistically valid for making predictions.

It is important to note that both cancers affects the female population and keep using the current method could jeopardize efficient planning of resources for diagnosis and treatment for them.

Considering that, in Brazil, government policies and programs are mostly focused on these types of cancer the situation may pose an important challenge to be overcome.

Finally, by evaluating Brazilian’s current approach, CSM, SSM and MLM using four exclusion criteria (mean 0, normality, ACF and homoscedasticity tests) and one decision criteria (lowest out of sample RMSE) we were able to establish the best unbiased model to each type of cancer, as we wanted to illustrate. We also emphasize that by comparing different methods we can potentially improve the main issue addressed in this research: how to provide an unbiased and reliable cancer forecasting.

Although it is not the focus of this research, causal and multivariate time-series models associated with other control variables such as cigarette smoking as a predictor of lung cancer and HPV vaccination coverage for cervical cancer should be investigated. Another promising direction is to investigate age-period-cohort (APC) models and combine them with the time-series models proposed in this research.

Conclusions

This research aimed to present and apply the main time-series-based models available in forecasting literature to the seven most prevalent types of cancer in Brazil. These models fall into three classes: classical statistical models, State-Space models, and machine learning models.

As mentioned in Theoretical Background section, it is the first attempt to apply unseen methods (TBATS, NNETAR and MLP) and the three classes of models to cancer prediction.

In Brazil, the incidence of cancer is not directly measured and must be estimated based on the mortality rate. Despite the challenge of not directly measuring cancer incidence, it is crucial for public health systems to estimate the incidence of a disease that ranks second in terms of mortality rate per 100,000 inhabitants.

While acknowledging the issue of not directly measuring incidence, our research mitigates this concern by utilizing the same data and employing the same cancer incidence estimation methods. This consistency ensures that our comparison between Brazil’s current prediction method and our proposed methods remains valid.

We also contributed to fulfill a literature gap identified in Table 1 by applying TBATS, MLP and NNETAR forecasting techniques predict seven cancer types in a Brazilian district.

Furthermore, we did not find any similar studies that compared the results of three classes of univariate time-series forecasting models or addressed more than one type of cancer.

When comparing only the error results (RMSE in sample and out of sample) between the approaches mentioned above and the current technique, we demonstrated that the current method underperforms for all types of cancer tested.

Moreover, in the Discussion section, we illustrated that, for breast and cervical cancers, the current approach applied in Brazil produced biased residuals, potentially affecting the quality and reliability of cancer incidence predictions in this country. Consequently, it may provide inaccurate information to healthcare decision-makers.

Therefore, we suggest that the methods evaluated in this study should be integrated into Brazil’s cancer forecast methodology to provide a reliable prediction for healthcare decision-makers.

To further researches, we also suggest a comparison between MLM time-series approaches. NNETAR and MLP (covered in this research) with LTSM which had been also used in recent previous works like Soltani et al.¹⁴ and Alrobai and Jilani¹⁵ presented in Table 1.

Although it was not the focus of this research, it should be noted that age-period-cohort (APC), previously mentioned in Section Theoretical Background, and Ensemble APC analysis as well as considering the birth-cohort effects^39,40 have potential to provide more accurate forecasts compared to traditional time-series methods that only consider period components.

Finally, by contributing with a proposal for the application of a set of tested forecasting methods to estimate the incidence of cancer in Brazil, it is intended that the results encourage a discussion on the adoption of anticipatory actions, aimed at prevention and the provision of means and resources for the early detection of the most prevalent types of cancer.

In this sense, to provide more robust predictions causal models could be also taking into account like we can see in^{41,42,43,44,45,46,47} applied to other diseases. Using them it is possible to evaluate the impact of smoking reduction or HPV vaccines strategies for lung and cervical cancer respectively, for instance.

Data availability

All relevant data are within the manuscript and its Supporting Information files.

Code availability

At Supplementary Material (Forecasting code.R).

References

Montgomery, D. C., Jennings, C. L. & Kulahci, M. Introduction to Time Series Analysis and Forecasting (Wiley, 2015).
Google Scholar
Yule, G. U. Vii. On a method of investigating periodicities in disturbed series, with special reference to Wolfer's sunspot numbers. Philos. Trans. R. Soc. Lond. Series A Contain. Pap. Math. Phys. Character 226(636–646), 267–298 (1927).
ADS Google Scholar
Holt, C. Forecasting seasonals and trends by exponentially weighted averages (ONR memorandum no. 52). Vol. 10 (Carnegie Institute of Technology, 1957).
Brown, R. G. Statistical Forecasting for Inventory Control (McGraw/Hill, 1959).
Google Scholar
Winters, P. R. Forecasting sales by exponentially weighted moving averages. Manage. Sci. 6(3), 324–342 (1960).
Article MathSciNet Google Scholar
Box, G. & Jenkins, G. Control (Halden-Day, 1970).
Google Scholar
Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3(3), 210–229 (1959).
Article MathSciNet Google Scholar
Kalman, R. E. et al. Contributions to the theory of optimal control. Bol. Soc. Mat. Mexicana 5(2), 102–119 (1960).
MathSciNet Google Scholar
Hyndman, R. J. & Athanasopoulos, G. Forecasting: Principles and Practice (OTexts, 2018).
Google Scholar
Lee, T. C., Dean, C. & Semenciw, R. Short-term cancer mortality projections: A comparative study of prediction methods. Stat. Med. 30(29), 3387–3402 (2011).
Article MathSciNet PubMed Google Scholar
Trächsel, B., Rousson, V., Bulliard, J.-L. & Locatelli, I. Comparison of statistical models to predict age-standardized cancer incidence in Switzerland. Biom. J. 65, 2200046 (2023).
Article MathSciNet Google Scholar
Chand, R., Rao, D. K., Tekabu, T. & Khan, M. G. Modeling breast cancer cases in fiji. In 2018 5th Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE) 283–290 (IEEE, 2018).
Lin, H., Shi, L., Zhang, J., Zhang, J. & Zhang, C. Epidemiological characteristics and forecasting incidence for patients with breast cancer in Shantou, Southern China: 2006–2017. Cancer Med. 10(8), 2904–2913 (2021).
Article PubMed PubMed Central Google Scholar
Soltani, M., Farahmand, M. & Pourghaderi, A. R. Machine learning-based demand forecasting in cancer palliative care home hospitalization. J. Biomed. Inform. 130, 104075 (2022).
Article PubMed Google Scholar
Alrobai, A., & Jilani, M. Cancer incidence prediction using a hybrid model of wavelet transform and lstm networks. In Advances in Data Science, Cyber Security and IT Applications: First International Conference on Computing, ICC 2019, Riyadh, Saudi Arabia, December 10–12, 2019, Proceedings, Part I 1 224–235 (Springer, 2019).
Lazam, N. M., Shair, S. N., Asmuni, N. H., Jamaludin, A., & Yusri, A. A. Forecasting the incidence rates of top three cancers in malaysia, in AIP Conference Proceedings, vol. 2500, 020052 (AIP Publishing LLC, 2023).
Tudor, C. A novel approach to modeling and forecasting cancer incidence and mortality rates through web queries and automated forecasting algorithms: Evidence from Romania. Biology 11(6), 857 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yasmeen, F. & Zaheer, S. Functional time series models to estimate future age-specific breast cancer incidence rates for women in Karachi, Pakistan. J. Health Sci. 2(5), 213–21 (2014).
Google Scholar
Xie, L. Time series analysis and prediction on cancer incidence rates. J. Med. Discov. 2(3), 1–10 (2017).
Article Google Scholar
Dalabanjan, M. S., & Agrawal, P. Forecasting age adjusted rates of lung cancer in mumbai by fitting arima models. In ICDSMLA 2020: Proceedings of the 2nd International Conference on Data Science, Machine Learning and Applications, 1181–1194 (Springer, 2022).
Instituto Nacional de Câncer José Alencar Gomes da Silva/ Ministério da Saúde: Atlas On-line de Mortalidade. Accessed 7 July 2023 https://www.inca.gov.br/MortalidadeWeb/pages/Modelo10/consultar.xhtml;jsessionid=289C9A6D91A1BFCEA8FDD2CDAE2A81A7 (2023)
Instituto Brasileiro de Geografia e Estatística - IBGE: Population Census. https://www.ibge.gov.br/en/statistics/social/labor/22836-2022-census-3.html, Brazil. [Online; accessed 7-July-2023] (2023)
Black, R., Bray, F., Ferlay, J. & Parkin, D. Cancer incidence and mortality in the European union: Cancer registry data and estimates of national incidence for 1990. Eur. J. Cancer 33(7), 1075–1107 (1997).
Article CAS PubMed Google Scholar
Ferlay, J. et al. Cancer incidence and mortality patterns in Europe: Estimates for 40 countries in 2012. Eur. J. Cancer 49(6), 1374–1403 (2013).
Article CAS PubMed Google Scholar
Ferlay, J. et al. Estimating the global cancer incidence and mortality in 2018: Globocan sources and methods. Int. J. Cancer 144(8), 1941–1953 (2019).
Article CAS PubMed Google Scholar
Hyndman, R. J. & Khandakar, Y. Automatic time series forecasting: The forecast package for r. J. Stat. Softw. 27, 1–22 (2008).
Article Google Scholar
Petris, G. An r package for dynamic linear models. J. Stat. Softw. 36, 1–16 (2010).
Article Google Scholar
Kourentzes, N. Nnfor: Time Series Forecasting with Neural Networks (2022). R package version 0.9.8. https://CRAN.R-project.org/package=nnfor
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022). https://www.R-project.org/
Kotu, V. & Deshpande, B. Time series forecasting. Data Science 395–445 (Elsevier, 2019).
Google Scholar
Haykin, S. Kalman Filtering and Neural Networks Vol. 47 (Wiley, 2004).
Google Scholar
De Livera, A. M., Hyndman, R. J. & Snyder, R. D. Forecasting time series with complex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 106(496), 1513–1527 (2011).
Article MathSciNet Google Scholar
Kourentzes, N., Barrow, D. K. & Crone, S. F. Neural network ensemble operators for time series forecasting. Expert Syst. Appl. 41(9), 4235–4244 (2014).
Article Google Scholar
Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965).
Article MathSciNet Google Scholar
Box, G. E. & Pierce, D. A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Am. Stat. Assoc. 65(332), 1509–1526 (1970).
Article MathSciNet Google Scholar
Pagano, M. Estimation of models of autoregressive signal plus white noise. Ann. Stat. 2, 99–108 (1974).
Article MathSciNet Google Scholar
Ljung, G. M. & Box, G. E. On a measure of lack of fit in time series models. Biometrika 65(2), 297–303 (1978).
Article Google Scholar
Bagchi, P., Characiejus, V. & Dette, H. A simple test for white noise in functional time series. J. Time Ser. Anal. 39(1), 54–74 (2018).
Article MathSciNet Google Scholar
Chen, Y.-C. et al. Forecast of a future leveling of the incidence trends of female breast cancer in Taiwan: An age-period-cohort analysis. Sci. Rep. 12(1), 12481 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Hsiao, B.-Y. et al. Ensemble forecasting of a continuously decreasing trend in bladder cancer incidence in Taiwan. Sci. Rep. 11(1), 8373 (2021).
Article CAS PubMed PubMed Central Google Scholar
Guo, H. et al. Time series study on the effects of daily average temperature on the mortality from respiratory diseases and circulatory diseases: A case study in Mianyang city. BMC Public Health 22(1), 1001 (2022).
Article PubMed PubMed Central Google Scholar
Lu, L. et al. Time series analysis of dengue fever and weather in Guangzhou, China. BMC Public Health 9, 1–5 (2009).
Article Google Scholar
Reyes-Urueña, J. M., Olalla, P. G. D., Perez-Hoyos, S. & Caylà, J. A. Time series analysis comparing mandatory and voluntary notification of newly diagnosed hiv infections in a city with a concentrated epidemic. BMC Public Health 13(1), 1–8 (2013).
Article Google Scholar
Yokoyama, S. et al. Day-to-day regularity and diurnal switching of physical activity reduce depression-related behaviors: A time-series analysis of wearable device data. BMC Public Health 23(1), 1–9 (2023).
Article Google Scholar
Sowe, A., Namatovu, F., Cham, B. & Gustafsson, P. E. Impact of a performance monitoring intervention on the timeliness of hepatitis b birth dose vaccination in the Gambia: A controlled interrupted time series analysis. BMC Public Health 23(1), 1–11 (2023).
Article Google Scholar
Zhu, G. et al. The association between ambient temperature and mortality of the coronavirus disease 2019 (covid-19) in Wuhan, china: A time-series analysis. BMC Public Health 21, 1–10 (2021).
Article Google Scholar
Luo, C. et al. Long-term air pollution levels modify the relationships between short-term exposure to meteorological factors, air pollution and the incidence of hand, foot and mouth disease in children: A DLNM-based multicity time series study in Sichuan province, china. BMC Public Health 22(1), 1484 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Gratitude is expressed to Laboratório de Engenharia e Gestão em Saúde (LEGOS/UERJ) and Cancer Foundation (Fundação Ary Frauzino para Pesquisa e Controle do Câncer) by providing support during this research APC process. Authors want to thank the support from Grant PID2022-137748OB-C31 funded by MCIN/AEI/10.13039/501100011033 and “ERDF A way of making Europe”.

Funding

The author(s) received no specific funding for this work.

Author information

Authors and Affiliations

Department of Industrial Engineering, Universidade do Estado do Rio de Janeiro, São Francisco Xavier, 524, Rio de Janeiro, Rio de Janeiro, 20550-900, Brazil
Daniel Bouzon Nagem Assad, Patricia Gomes Ferreira da Costa & Thaís Spiegel
Escuela Técnica Superior de Ingenieros Industriales, Universidad Politécnica De Madrid, Jose Gutierrez Abascal, 2, 28006, Madrid, Madrid, Spain
Daniel Bouzon Nagem Assad, Javier Cara & Miguel Ortega-Mier
Fundação Ary Frauzino para Pesquisa e Controle do Câncer, Inválidos, 212, Rio de Janeiro, Rio de Janeiro, 20231-048, Brazil
Alfredo Monteiro Scaff

Authors

Daniel Bouzon Nagem Assad
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Gomes Ferreira da Costa
View author publications
You can also search for this author in PubMed Google Scholar
Thaís Spiegel
View author publications
You can also search for this author in PubMed Google Scholar
Javier Cara
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Ortega-Mier
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Monteiro Scaff
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.B.N.A., J.C and M.O.M. conceived of the presented idea. D.B.N.A. and P.G.F.C carried out the experiment and wrote the manuscript with support from T.S. and A.M.S. T.S. and A.M.S. supervised the project. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Corresponding author

Correspondence to Daniel Bouzon Nagem Assad.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bouzon Nagem Assad, D., Gomes Ferreira da Costa, P., Spiegel, T. et al. Comparing the current short-term cancer incidence prediction models in Brazil with state-of-the-art time-series models. Sci Rep 14, 4566 (2024). https://doi.org/10.1038/s41598-024-55230-2

Download citation

Received: 17 July 2023
Accepted: 21 February 2024
Published: 25 February 2024
DOI: https://doi.org/10.1038/s41598-024-55230-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.