Abstract
The World Health Organization has highlighted that cancer was the second-highest cause of death in 2019. This research aims to present the current forecasting techniques found in the literature, applied to predict time-series cancer incidence and then, compare these results with the current methodology adopted by the Instituto Nacional do Câncer (INCA) in Brazil. A set of univariate time-series approaches is proposed to aid decision-makers in monitoring and organizing cancer prevention and control actions. Additionally, this can guide oncological research towards more accurate estimates that align with the expected demand. Forecasting techniques were applied to real data from seven types of cancer in a Brazilian district. Each method was evaluated by comparing its fit with real data using the root mean square error, and we also assessed the quality of noise to identify biased models. Notably, three methods proposed in this research have never been applied to cancer prediction before. The data were collected from the INCA website, and the forecast methods were implemented using the R language. Conducting a literature review, it was possible to draw comparisons previous works worldwide to illustrate that cancer prediction is often focused on breast and lung cancers, typically utilizing a limited number of time-series models to find the best fit for each case. Additionally, in comparison to the current method applied in Brazil, it has been shown that employing more generalized forecast techniques can provide more reliable predictions. By evaluating the noise in the current method, this research shown that the existing prediction model is biased toward two of the studied cancers Comparing error results between the mentioned approaches and the current technique, it has been shown that the current method applied by INCA underperforms in six out of seven types of cancer tested. Moreover, this research identified that the current method can produce a biased prediction for two of the seven cancers evaluated. Therefore, it is suggested that the methods evaluated in this work should be integrated into the INCA cancer forecast methodology to provide reliable predictions for Brazilian healthcare professionals, decision-makers, and oncological researchers.
Similar content being viewed by others
Theoretical background
A time series is a sequence of time-oriented observations related to forecasting or controlling a specific variable1. This thematic study originated in 1927, adopting a general approach to time series analysis2. Nearly three decades later, new time series forecasting approaches began to emerge.
Initially, classical time series statistical models were proposed3. Subsequently, these models were refined to include exponential smoothing techniques4,5 before evolving into auto-regressive moving average models6. Eventually, they progressed further to incorporate Machine Learning7 and State-Space models8.
In all instances, the predictability of future events is a central element, crucial for planning and processes related to Operations Management, among others, such as Marketing, Economics, and Demography1. However, the predictability of an event or quantity depends on various factors, including an understanding of the influencing factors, data availability, future and past similarities, and the potential impact of forecasts on the predicted outcome9.
In the context of oncology studies, mortality and incidence projection methods were already compared in Canada, using age-period-cohort (APC), auto-regressive time series, and space-state models at least for ten cancer types10.
APC and Bayesian APC, auto-regressive integrated moving average (ARIMA) time series, and simple linear models were also compared for five cancer types in Switzerland11.
Using reported breast cancer cases in the Fijian population from 1995 to 2016, Chand et al.12 attempted to apply an ARIMA model to provide a 12-month ahead prediction. However, faced with non-stationary data according to the Augmented Dickey-Fuller test, a linear regression model was chosen. The proposed model was compared with the Naive Forecast Method, showing that the linear regression model outperformed the Naive Forecast Method.
Also exploring the epidemiological characteristics of breast cancer, Lin et al.13 used Exponential Smoothing (ETS) and Autoregressive Integrated Moving Average (ARIMA) models to forecast breast cancer incidence in China.
Regarding palliative cancer care, two different long short-term memory (LSTM) models were proposed, aiming to forecast the patients’ next visit day and estimate the total patient demand 1 week ahead14. For this, was take into account their requirements, demographics, and each service history profile.
Alrobai and Jilani15 also applied LSTM to forecast the incidence of the three most prevalent cancers in Saudi Arabia. However, it’s crucial to note that cancer prevalence can significantly vary from one country to another.
In Malaysia, to deal with the continued annual growth in cancer incidence rates, particularly female breast, colorectal, and lung cancer, Lazam et al.16 tested ARIMA and Exponential Smoothing (ETS) models. They intended to determine the best rates for incidence prediction for these mentioned types of cancer.
Tudor17 proposed alternative ways to forecast cancer incidence and mortality by connecting population web-search practices with health variables officially published by Romanian authorities. The applied models included ARIMA, the Exponential Smoothing State-Space Model with Box-Cox Transformation, ARMA Errors, Trend, and Seasonal Components, and a feed-forward neural network nonlinear autoregression model.
In this research, conducted in Brazil, we present the framework to evaluate previous works on cancer time-series prediction, dividing the time-series prediction according to Hyndman and Athanasopoulos9 into Classical Statistical models, State-Space models, and Machine Learning models (Table 1). For this, only researches that makes cancer predictions were considered.
After comparing ten previous works related to cancer incidence prediction (Table 1), we can conclude that:
-
1.
Breast and lung cancer incidence predictions have garnered more attention in specialized literature and have been studied in 8 and 7 works, respectively; colorectal cancer has been studied in 5 works, while other cancer types have been studied in 4 works or less.
-
2.
CSM and particularly ARIMA were the most used approaches.
-
3.
Considering SSM and MLM, TBATS NNETAR, and MLP were never covered before in previous research.
-
4.
We found no previous work in which all three classes of models were applied.
As will be presented in this paper, the third and fourth conclusions allow us to state that this work covers a gap in current cancer prediction. Thus, applying unseen methods (3rd) and the three classes of models (4th) to cancer prediction is an original contribution of this research.
Finally, the mentioned studies address the application of different forecasting methods in countries such as Canada, Switzerland, Fiji, China, Malaysia, and Romania. Their use in Brazil, for a larger sample of types of cancer and comparing them, seems like a complementary contribution.
Methods
Data collection
In this research, we analyze real cancer data from Brazil obtained from INCA. All time-series used are presented in Fig. 1 and are also available at Table 2. The seven cancer types evaluated are: Breast cancer (ICD-10 C50), Colorectal Cancer (ICD-10 C18 to C21), Prostate cancer (ICD-10 C61), Lung cancer (ICD-10 C33 and C34), Cervical cancer (ICD-10 C53), Head and Neck Cancer (ICD-10 C00 to C10) and Childhood Cancer (ICD-10 C00 to C96).
The filters employed for each type of cancer can be found in Table 3. We gathered data on the mortality rates for Brazilian cancer from INCA’s website21. The population figures were obtained from the 2022 Brazilian census22.
In Brazil, the cancer incidence is not registered to all districts. So, INCA works with an approximate incidence inferred from the mortality rate considering Black et al.23, Ferlay et al.24 and Ferlay et al.25 estimation methodologies based mainly on the I/M ratio.
The mentioned methodologies links the unknown incidence rate-adjusted (IRa) to the known mortality rate-adjusted (MRa) of some district by the equation \(IRa = MRa*(I_R/M_0)\). Where \(I_R\) refers to known incidence of districts geographically near from the targeted unmeasured district and \(M_0\) refers to number of deaths of the same districts. The results of \(I_R/M_0\) ratio to the unmeasured district evaluated is presented in Table 4.
In Fig. 1 we present for each type of studied cancer the mortality rate by world population-adjusted by 100,000 inhabitants. Then, according to23−25,24, estimation methodologies the expected incidence rate-adjusted IRa can be obtained by multiplying the mortality rate-adjusted in Fig. 1 and the values presented in Table 4 to each cancer type.
The current short-term predictions in Brazil rely on the average of the past 3 recent years. This outcome serves as a reference for the Brazilian public health system over the next 3 years. In essence, the existing approach is a simple moving average (MA).
Forecasting models applied
In this research we apply the univariate forecasting methods available in Hyndman and Khandakar26, Petris27 and Kourentzes28. Models applied in next sections are presented in Table 5. These models were implemented in R29 language (version 4.1.3) and the code used is available at Supplementary Material (Forecasting code.R).
To build each model is necessary to estimate many parameters, but the main features of each model are presented forward:
-
ETS: ETS is a class of models that essentially works with three components equations level (\(l_t\)), trend (\(b_t\)) and season (\(s_t\)) to explain the original time series variable (\(y_t\)) that we aim to forecast. In each model these components cannot be significant, also known as None (N) or can be significant and better described \(y_t\) as Additive (A) or Additive Damped (Ad) or Multiplicative (M) features. This class of models can be combined in 18 different ways (Fig. 2). For more details see Hyndman and Athanasopoulos9.
-
ARIMA: ARIMA or Seasonal ARIMA (SARIMA) is a class of models that combine autoregressive (AR) and moving average (MA) with differenced values. The AR part of ARIMA (p) shows that the time series is regressed on its own past data. The MA part of ARIMA (q) indicates that the forecast error is a linear combination of past respective errors. The I part of ARIMA (d) refers to differenced values of d order to obtain stationary time-series in which ARMA model approach can be applied Kotu and Deshpande (2019)30. The difference between ARIMA and SARIMA models remains on the same components appearing lagged by the length of seasonal time window (frequency) as P, D and Q. For more details see Hyndman and Athanasopoulos9 and Kotu and Deshpande30.
-
Kalman filter (KF): KF methods search the smallest vector that summarizes the past of the system that better describes the state of a deterministic dynamic system31. KF equation is basically composed by a linear autoregressive equation \({x(t)} = A*x(t) + W(t)\) where \(W(t) \approx N(0,Q)\) with a measurement that is \({y(t)} = C*y(t) + V(t)\) where \(V(t) \approx N(0,R)\) that defines the linearized process in which \(y(t) \in {\mathbb {R}}\). The random variables W(t) and V(t) are assumed to be independent of each other and both must follow a normal distribution.
-
TBATS: TBATS model is Trigonometric Seasonal (T) Exponential Smoothing Method + Box-Cox Transformation + ARMA model for residuals (BATS). Equations of the TBATS model are presented in equations below where \(\omega\) and \(\phi\) are Box-Cox and the damping parameters respectively, ARMA(p, q) process model the error and \(m_1\) to \(m_J\) list the seasonal periods used while \(k_1\) to \(k_J\) are the corresponding number of Fourier terms used. For more details see De Liveira et al.32.
$$\begin{aligned} y_{t}^{(\omega )}= & {} \frac{ y_{t}^{(\omega )}-1}{\omega }, \omega \ne 0,\\ y_{t}^{(\omega )}= & {} \log {y_{t}}, \omega = 0,\\ y_{t}^{(\omega )}= & {} l_{t-1}+\phi *b_{t-1}+\sum _{i=1}^{t} s_{t-m_i}^{i} +d_t,\\ l_{t}= & {} l_{t-1}+\phi *b_{t-1} +\alpha *d_t,\\ b_{t}= & {} (1-\phi )*b_t +\phi *b_{t-1}+\beta *d_t,\\ s_{t}^{i}= & {} s_{t-m_i}^{i} +\gamma _i *d_t,\\ d_{t}= & {} \sum _{i=1}^{p} \phi _i*d_{t-i}+\sum _{i=1}^{q} \theta _i*\epsilon _{t-i} +\epsilon _{t},\\ s_{t}^{i}= & {} \sum _{j=1}^{k_j} s_{j,t}^{i},\\ s_{t}^{i}= & {} s_{j,t-1}^{i}*\cos {\lambda _j^i} + s_{j,t-1}^{*i}*\sin {\lambda _j^i} + \gamma _1^i*d_t,\\ s_{t}^{*i}= & {} s_{j,t-1}^{i}*\sin {\lambda _j^i} + s_{j,t-1}^{*i}*\cos {\lambda _j^i} + \gamma _2^i*d_t\\ \end{aligned}$$ -
NNETAR: Neural Network Time Series Forecasts (NNETAR) is a class of feed-forward neural networks with a single hidden layer and lagged inputs. This model works with 2 (for non seasonal time-series) or 3 (for seasonal time-series) parameters: the number of past observations used as input layers (p), the number of past observations lagged by the length of seasonal time window used as input layers (P) and the number of neurons (k) in the single layer. In this research, a total of 20 repeats networks are fitted, each with random starting weights. These are then averaged when computing forecasts. The network is trained for one-step forecasting. Multi-step forecasts are computed recursively. The k selected to each type of cancer it the half of the number of input nodes plus 1. For non-seasonal data, the fitted model is denoted as an NNAR (p, k) (Neural Network Autoregressive) model which is analogous to an AR (p) model but with nonlinear functions. For seasonal data, the fitted model is called an NNAR (p, P, k)[m] model, which is analogous to an ARIMA (p, 0, 0)(P, 0, 0)[m] model but with nonlinear functions. For more details see Hyndman and Athanasopoulos9.
-
MLP: MLP is an extension of feed-forward neural network where an arbitrary number of hidden layers that are placed in between the input and output layer (the truly computational engine of the MLP). According to Kourentzes et al.33, MLPs are designed to approximate any continuous function and can solve problems which are not linearly separable. In our case, the time-series problem proposed our input layer (like NNETARs’ model p) are the most recent past observations and we set the MLP model to choose the best number of input layers between 1 and the prediction length (3 years) lags will be used according to Mean Square Error. The same criteria were also adopted to choose the number of hidden nodes in each hidden layer. For more details see Kourentzes et al.33.
Forecasting models evaluation
The dataset presented in Table 2 were multiplied by I/M ratio for each cancer type shown in Table 4 to estimate the incidence rate of each type of cancer evaluated (Fig. 3).
For instance, to Breast cancer, the ICD-10 Mortality rate by 100,000 inhabitants are 17,77 in 1979, 21.73 in 1980 and so on (second column Table 2). Thus, the Breast cancer Incidence rate-ajusted will be these values multiplied by 5.59 (Breast cancers’ \(I_R/M_0\) ratio in Table 4) which are 77.65 in 1979, 94.96 in 1980, 69 in 1981 and so on that can be seen in Fig. 3.
In this research, we are interested in provide a comparison between Brazilian’s current short-term cancer prediction and the time-series state of art models. As mentioned in Section Theoretical Background, as long as the current short-term cancer prediction are made 3 years ahead, we split our dataset into training data (from 1979 to 2017) and test data (from 2018 to 2020).
Training (in sample) and test (out of sample) data are evaluated using the Root Mean Square Error (RMSE) criterion. A low RMSE in sample value indicates a good average fit of the model used while a low value of RMSE out of sample indicates that the model used, on average, delivers a reliable forecast9.
Below we present the criteria adopted to evaluate the current and proposed methods predictions to each cancer type:
-
The noise evaluation over the training (in sample) data according to the following tests: student (ST), normality (NT), Auto-correlation function (ACF) plot and Breusch-Pagan (BPT);
-
The error evaluation according to the test (out of sample) Root Mean Square Error (RMSE).
If the residuals produced a 0 mean error in Student-test, follows a normal distribution in Shapiro–Wilk test, remains between the interval defined by the blue lines in ACF plot test to all lags and presented no constant variance all over the time (homoscedasticity) in Breusch-Pagan test, we consider that the model residuals produced a white noise which means that the model is unbiased34,35,36,37,38.
The significance level adopted in this research is 0.05 which means that residuals produced a white noise if the obtained p-values in each test are higher than 0.05 to each model.
Thus, in this research we consider that the best model for each cancer type is given by their residual evaluation that (1) fulfill all requirements previously presented and (2) obtained the lowest out of sample RMSE.
Results
In this section we apply the methods presented in columns of Table 5 to each type of cancer incidence presented in Fig. 3. In Table 6 we summarize the in sample and out of sample RMSE results by model and type of cancer.
As mentioned in Forecasting models evaluation section, to compare models errors summarized in Table 6 we select the out of sample RMSE criterion. Then, to ensure that models residuals give us a white noise in the training data we apply the Student test (Table 7), the ACF plot, the Shappiro-Wink normality test (Table 8) and the Breusch-Pagan test (Table 9).
As mentioned in Section Forecasting models evaluation, besides considering RMSE criteria we must also evaluate if each model produced residual values with a white error noise taking into account their auto-correlation plots and normality test to all cancer types (Table 6).
This evaluation is presented for all types of cancer evaluated, grouped (Figure 4) and individually—breast (Figure 5), colorectal (Figure 6), prostate (Figure 7), lung (Figure 8), cervical (Figure 9), head and neck (Figure 10) and childhood (Figure 11).
The white noise failure evaluation by model and by cancer type is summarized in Table 10.
Considering the criteria presented in Section Forecasting models evaluation to ensure an unbiased model, we must select the best model to each type of cancer evaluated discarding the result of the following failed (biased) models for:
-
Current model, ETS, ARIMA, TBATS and KF to breast cancer which failed in Auto-correlation function (ACF) plot presented in Fig. 5 and, in normality test, MLP failed.
-
MLP to colorectal cancer which failed in ACF plot presented in Fig. 6, Breusch-Pagan test and in normality test. NNETAR also failed in Breusch-Pagan test.
-
NNETAR to prostate cancer which failed in Auto-correlation function (ACF) plot presented in Fig. 7 and MLP failed in normality test.
-
KF to lung cancer which failed in student test, ACF plot presented in Fig. 8 and, in normality test and ACF plot, MLP failed.
-
Cervical cancer presented residuals produced a significant ACF plot only to current model as presented in Fig. 9. MLP failed in normality test.
-
ARIMA to head and neck cancer which failed in ACF plot presented in Fig. 10 and, in normality test, KF and MLP failed.
-
MLP to childhood cancer which failed in normality test.
Thus, the best model to each cancer type are: NNETAR for breast, KF for colorectal, ARIMA for prostate, TBATS for lung, KF for cervical, the current method for Head and neck and KF for childhood.
Their prediction plots can be seen respectively in Figs. 12, 13, 14, 15, 16, 17 and 18. The 3-year ahead prediction values are summarized in Table 11
Discussion
A limitation of this research could be observed in the method used to obtain the incidence of cancer in Brazil. This occurs because, in practice, the incidence is not measured. Thus, we used cancer incidence estimation methodologies proposed in Black et al.23, Ferlay et al.24 and Ferlay et al.25 which are based on the mortality rate discussed in Section Data collection.
Considering that the presented methodologies can give us the best cancer incidence estimation evaluating only time-series univariate models, our findings in Table 6 seem to indicate that the current model applied by INCA in Brazil to forecasting cancer incidence underperform in 6 of the 7 type of cancers proposed in this research. So, the presented methodologies seem to behave more adequately than the Brazilian’s current methodology.
It is important to note that we are working with the same type and amount of data that is used today, meaning that it would not be necessary to collect new variables in order to increase the accuracy of the forecast.
In addition, we did not see the CSM models outperform the others in any type of cancer, although ARIMA models (CSM) are the most widely used models in the current literature so far as we presented in Table 1.
These facts imply that, while there is no broad and reliable Population-Based Cancer Registries in the country, all research that use these data as a primary source will be limited; including this one.
However, it is necessary to consider that Brazil has continental dimensions and a technological backwardness that do not facilitate the implementation of this type of record. Although restrictive, the fact has not prevented research and public policies aimed to cancer prevention and control in the country, that surely could be more effective.
In this sense, we reinforce that it is not possible to invalidate what has been done in the country, but to plead for the opening of space so that new, more accurate forecast models can be adopted, aiming at supporting strategic decisions to face cancer in the country. Even because the current literature has used models that go in the opposite direction of the results presented by this research in Table 1.
For instance, MLM models were only used in Soltani et al.14 and Alrobai and Jilani15 works and only LTSM were evaluated. Considering SSM, the current literature presents only Lee et al.10 research in which only KF approach is proposed.
In Table 11, we see that SSM (KF and TBATS) was selected in four of seven type of cancers evaluated while MLM (NNETAR), CSM (ARIMA) and current method where selected to one type of cancer.
The evaluation process adopted in this research and presented in Section Forecasting models evaluation was crucial to identify and discard biased models to each type of cancer. If we had only considered in sample RMSE criterion (measuring the best fitted model, on average) to select the models to each type of cancer, MLP would be selected in all time-series evaluated.
On the other hand, if we considered only out of sample RMSE criterion (measuring the best predicted values, on average), ARIMA and MLP would be selected in two types of cancer while ETS, TBATS and KF would be selected in only one type of cancer time-series (NNETAR and current method would not be selected).
The noise evaluation process adopted also allowed us to state that the current model can potentially provide a biased prediction because it failed in ACF plot to Breast and Cervical cancer as we can see in Fig. 4. Therefore, we cannot classify it as statistically valid for making predictions.
It is important to note that both cancers affects the female population and keep using the current method could jeopardize efficient planning of resources for diagnosis and treatment for them.
Considering that, in Brazil, government policies and programs are mostly focused on these types of cancer the situation may pose an important challenge to be overcome.
Finally, by evaluating Brazilian’s current approach, CSM, SSM and MLM using four exclusion criteria (mean 0, normality, ACF and homoscedasticity tests) and one decision criteria (lowest out of sample RMSE) we were able to establish the best unbiased model to each type of cancer, as we wanted to illustrate. We also emphasize that by comparing different methods we can potentially improve the main issue addressed in this research: how to provide an unbiased and reliable cancer forecasting.
Although it is not the focus of this research, causal and multivariate time-series models associated with other control variables such as cigarette smoking as a predictor of lung cancer and HPV vaccination coverage for cervical cancer should be investigated. Another promising direction is to investigate age-period-cohort (APC) models and combine them with the time-series models proposed in this research.
Conclusions
This research aimed to present and apply the main time-series-based models available in forecasting literature to the seven most prevalent types of cancer in Brazil. These models fall into three classes: classical statistical models, State-Space models, and machine learning models.
As mentioned in Theoretical Background section, it is the first attempt to apply unseen methods (TBATS, NNETAR and MLP) and the three classes of models to cancer prediction.
In Brazil, the incidence of cancer is not directly measured and must be estimated based on the mortality rate. Despite the challenge of not directly measuring cancer incidence, it is crucial for public health systems to estimate the incidence of a disease that ranks second in terms of mortality rate per 100,000 inhabitants.
While acknowledging the issue of not directly measuring incidence, our research mitigates this concern by utilizing the same data and employing the same cancer incidence estimation methods. This consistency ensures that our comparison between Brazil’s current prediction method and our proposed methods remains valid.
We also contributed to fulfill a literature gap identified in Table 1 by applying TBATS, MLP and NNETAR forecasting techniques predict seven cancer types in a Brazilian district.
Furthermore, we did not find any similar studies that compared the results of three classes of univariate time-series forecasting models or addressed more than one type of cancer.
When comparing only the error results (RMSE in sample and out of sample) between the approaches mentioned above and the current technique, we demonstrated that the current method underperforms for all types of cancer tested.
Moreover, in the Discussion section, we illustrated that, for breast and cervical cancers, the current approach applied in Brazil produced biased residuals, potentially affecting the quality and reliability of cancer incidence predictions in this country. Consequently, it may provide inaccurate information to healthcare decision-makers.
Therefore, we suggest that the methods evaluated in this study should be integrated into Brazil’s cancer forecast methodology to provide a reliable prediction for healthcare decision-makers.
To further researches, we also suggest a comparison between MLM time-series approaches. NNETAR and MLP (covered in this research) with LTSM which had been also used in recent previous works like Soltani et al.14 and Alrobai and Jilani15 presented in Table 1.
Although it was not the focus of this research, it should be noted that age-period-cohort (APC), previously mentioned in Section Theoretical Background, and Ensemble APC analysis as well as considering the birth-cohort effects39,40 have potential to provide more accurate forecasts compared to traditional time-series methods that only consider period components.
Finally, by contributing with a proposal for the application of a set of tested forecasting methods to estimate the incidence of cancer in Brazil, it is intended that the results encourage a discussion on the adoption of anticipatory actions, aimed at prevention and the provision of means and resources for the early detection of the most prevalent types of cancer.
In this sense, to provide more robust predictions causal models could be also taking into account like we can see in41,42,43,44,45,46,47 applied to other diseases. Using them it is possible to evaluate the impact of smoking reduction or HPV vaccines strategies for lung and cervical cancer respectively, for instance.
Data availability
All relevant data are within the manuscript and its Supporting Information files.
Code availability
At Supplementary Material (Forecasting code.R).
References
Montgomery, D. C., Jennings, C. L. & Kulahci, M. Introduction to Time Series Analysis and Forecasting (Wiley, 2015).
Yule, G. U. Vii. On a method of investigating periodicities in disturbed series, with special reference to Wolfer's sunspot numbers. Philos. Trans. R. Soc. Lond. Series A Contain. Pap. Math. Phys. Character 226(636–646), 267–298 (1927).
Holt, C. Forecasting seasonals and trends by exponentially weighted averages (ONR memorandum no. 52). Vol. 10 (Carnegie Institute of Technology, 1957).
Brown, R. G. Statistical Forecasting for Inventory Control (McGraw/Hill, 1959).
Winters, P. R. Forecasting sales by exponentially weighted moving averages. Manage. Sci. 6(3), 324–342 (1960).
Box, G. & Jenkins, G. Control (Halden-Day, 1970).
Samuel, A. L. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3(3), 210–229 (1959).
Kalman, R. E. et al. Contributions to the theory of optimal control. Bol. Soc. Mat. Mexicana 5(2), 102–119 (1960).
Hyndman, R. J. & Athanasopoulos, G. Forecasting: Principles and Practice (OTexts, 2018).
Lee, T. C., Dean, C. & Semenciw, R. Short-term cancer mortality projections: A comparative study of prediction methods. Stat. Med. 30(29), 3387–3402 (2011).
Trächsel, B., Rousson, V., Bulliard, J.-L. & Locatelli, I. Comparison of statistical models to predict age-standardized cancer incidence in Switzerland. Biom. J. 65, 2200046 (2023).
Chand, R., Rao, D. K., Tekabu, T. & Khan, M. G. Modeling breast cancer cases in fiji. In 2018 5th Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE) 283–290 (IEEE, 2018).
Lin, H., Shi, L., Zhang, J., Zhang, J. & Zhang, C. Epidemiological characteristics and forecasting incidence for patients with breast cancer in Shantou, Southern China: 2006–2017. Cancer Med. 10(8), 2904–2913 (2021).
Soltani, M., Farahmand, M. & Pourghaderi, A. R. Machine learning-based demand forecasting in cancer palliative care home hospitalization. J. Biomed. Inform. 130, 104075 (2022).
Alrobai, A., & Jilani, M. Cancer incidence prediction using a hybrid model of wavelet transform and lstm networks. In Advances in Data Science, Cyber Security and IT Applications: First International Conference on Computing, ICC 2019, Riyadh, Saudi Arabia, December 10–12, 2019, Proceedings, Part I 1 224–235 (Springer, 2019).
Lazam, N. M., Shair, S. N., Asmuni, N. H., Jamaludin, A., & Yusri, A. A. Forecasting the incidence rates of top three cancers in malaysia, in AIP Conference Proceedings, vol. 2500, 020052 (AIP Publishing LLC, 2023).
Tudor, C. A novel approach to modeling and forecasting cancer incidence and mortality rates through web queries and automated forecasting algorithms: Evidence from Romania. Biology 11(6), 857 (2022).
Yasmeen, F. & Zaheer, S. Functional time series models to estimate future age-specific breast cancer incidence rates for women in Karachi, Pakistan. J. Health Sci. 2(5), 213–21 (2014).
Xie, L. Time series analysis and prediction on cancer incidence rates. J. Med. Discov. 2(3), 1–10 (2017).
Dalabanjan, M. S., & Agrawal, P. Forecasting age adjusted rates of lung cancer in mumbai by fitting arima models. In ICDSMLA 2020: Proceedings of the 2nd International Conference on Data Science, Machine Learning and Applications, 1181–1194 (Springer, 2022).
Instituto Nacional de Câncer José Alencar Gomes da Silva/ Ministério da Saúde: Atlas On-line de Mortalidade. Accessed 7 July 2023 https://www.inca.gov.br/MortalidadeWeb/pages/Modelo10/consultar.xhtml;jsessionid=289C9A6D91A1BFCEA8FDD2CDAE2A81A7 (2023)
Instituto Brasileiro de Geografia e Estatística - IBGE: Population Census. https://www.ibge.gov.br/en/statistics/social/labor/22836-2022-census-3.html, Brazil. [Online; accessed 7-July-2023] (2023)
Black, R., Bray, F., Ferlay, J. & Parkin, D. Cancer incidence and mortality in the European union: Cancer registry data and estimates of national incidence for 1990. Eur. J. Cancer 33(7), 1075–1107 (1997).
Ferlay, J. et al. Cancer incidence and mortality patterns in Europe: Estimates for 40 countries in 2012. Eur. J. Cancer 49(6), 1374–1403 (2013).
Ferlay, J. et al. Estimating the global cancer incidence and mortality in 2018: Globocan sources and methods. Int. J. Cancer 144(8), 1941–1953 (2019).
Hyndman, R. J. & Khandakar, Y. Automatic time series forecasting: The forecast package for r. J. Stat. Softw. 27, 1–22 (2008).
Petris, G. An r package for dynamic linear models. J. Stat. Softw. 36, 1–16 (2010).
Kourentzes, N. Nnfor: Time Series Forecasting with Neural Networks (2022). R package version 0.9.8. https://CRAN.R-project.org/package=nnfor
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022). https://www.R-project.org/
Kotu, V. & Deshpande, B. Time series forecasting. Data Science 395–445 (Elsevier, 2019).
Haykin, S. Kalman Filtering and Neural Networks Vol. 47 (Wiley, 2004).
De Livera, A. M., Hyndman, R. J. & Snyder, R. D. Forecasting time series with complex seasonal patterns using exponential smoothing. J. Am. Stat. Assoc. 106(496), 1513–1527 (2011).
Kourentzes, N., Barrow, D. K. & Crone, S. F. Neural network ensemble operators for time series forecasting. Expert Syst. Appl. 41(9), 4235–4244 (2014).
Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965).
Box, G. E. & Pierce, D. A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Am. Stat. Assoc. 65(332), 1509–1526 (1970).
Pagano, M. Estimation of models of autoregressive signal plus white noise. Ann. Stat. 2, 99–108 (1974).
Ljung, G. M. & Box, G. E. On a measure of lack of fit in time series models. Biometrika 65(2), 297–303 (1978).
Bagchi, P., Characiejus, V. & Dette, H. A simple test for white noise in functional time series. J. Time Ser. Anal. 39(1), 54–74 (2018).
Chen, Y.-C. et al. Forecast of a future leveling of the incidence trends of female breast cancer in Taiwan: An age-period-cohort analysis. Sci. Rep. 12(1), 12481 (2022).
Hsiao, B.-Y. et al. Ensemble forecasting of a continuously decreasing trend in bladder cancer incidence in Taiwan. Sci. Rep. 11(1), 8373 (2021).
Guo, H. et al. Time series study on the effects of daily average temperature on the mortality from respiratory diseases and circulatory diseases: A case study in Mianyang city. BMC Public Health 22(1), 1001 (2022).
Lu, L. et al. Time series analysis of dengue fever and weather in Guangzhou, China. BMC Public Health 9, 1–5 (2009).
Reyes-Urueña, J. M., Olalla, P. G. D., Perez-Hoyos, S. & Caylà, J. A. Time series analysis comparing mandatory and voluntary notification of newly diagnosed hiv infections in a city with a concentrated epidemic. BMC Public Health 13(1), 1–8 (2013).
Yokoyama, S. et al. Day-to-day regularity and diurnal switching of physical activity reduce depression-related behaviors: A time-series analysis of wearable device data. BMC Public Health 23(1), 1–9 (2023).
Sowe, A., Namatovu, F., Cham, B. & Gustafsson, P. E. Impact of a performance monitoring intervention on the timeliness of hepatitis b birth dose vaccination in the Gambia: A controlled interrupted time series analysis. BMC Public Health 23(1), 1–11 (2023).
Zhu, G. et al. The association between ambient temperature and mortality of the coronavirus disease 2019 (covid-19) in Wuhan, china: A time-series analysis. BMC Public Health 21, 1–10 (2021).
Luo, C. et al. Long-term air pollution levels modify the relationships between short-term exposure to meteorological factors, air pollution and the incidence of hand, foot and mouth disease in children: A DLNM-based multicity time series study in Sichuan province, china. BMC Public Health 22(1), 1484 (2022).
Acknowledgements
Gratitude is expressed to Laboratório de Engenharia e Gestão em Saúde (LEGOS/UERJ) and Cancer Foundation (Fundação Ary Frauzino para Pesquisa e Controle do Câncer) by providing support during this research APC process. Authors want to thank the support from Grant PID2022-137748OB-C31 funded by MCIN/AEI/10.13039/501100011033 and “ERDF A way of making Europe”.
Funding
The author(s) received no specific funding for this work.
Author information
Authors and Affiliations
Contributions
D.B.N.A., J.C and M.O.M. conceived of the presented idea. D.B.N.A. and P.G.F.C carried out the experiment and wrote the manuscript with support from T.S. and A.M.S. T.S. and A.M.S. supervised the project. All authors provided critical feedback and helped shape the research, analysis and manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bouzon Nagem Assad, D., Gomes Ferreira da Costa, P., Spiegel, T. et al. Comparing the current short-term cancer incidence prediction models in Brazil with state-of-the-art time-series models. Sci Rep 14, 4566 (2024). https://doi.org/10.1038/s41598-024-55230-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-55230-2
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.