Dynamically prognosticating patients with hepatocellular carcinoma through survival paths mapping based on time-series data

Patients with hepatocellular carcinoma (HCC) always require routine surveillance and repeated treatment, which leads to accumulation of huge amount of clinical data. A predictive model utilizes the time-series data to facilitate dynamic prognosis prediction and treatment planning is warranted. Here we introduced an analytical approach, which converts the time-series data into a cascading survival map, in which each survival path bifurcates at fixed time interval depending on selected prognostic features by the Cox-based feature selection. We apply this approach in an intermediate-scale database of patients with BCLC stage B HCC and get a survival map consisting of 13 different survival paths, which is demonstrated to have superior or equal value than conventional staging systems in dynamic prognosis prediction from 3 to 12 months after initial diagnosis in derivation, internal testing, and multicentric testing cohorts. This methodology/model could facilitate dynamic prognosis prediction and treatment planning for patients with HCC in the future. Patients with hepatocellular carcinoma require regular follow-up. Here, using Cox-based feature selection to identify key prognostic features, the authors convert time-series follow-up data into a cascading survival map, and show that the approach improves dynamic prognosis prediction for patients.

H epatocellular carcinoma (HCC) is the sixth most common cancer and the third leading cause of cancer-related death worldwide 1 , characterized by multicentric origins, high risk of intrahepatic recurrence, and poor prognosis 2 . For most patients with intermediate-stage HCC, comprehensive treatments using transarterial chemoembolization (TACE), ablative therapies, surgery, or their combinations were widely adopted; besides, frequent disease surveillance and re-treatment are needed [3][4][5][6] . The current dynamic prognostication systems, including hepatoma arterial-embolization prognostic score and assessment for retreatment with TACE (ART) score were designed specifically for patients receiving arterial-embolization therapies 7,8 . A novel prognostication system that suitable for HCC patients receiving comprehensive treatments and facilitate dynamic treatment planning is desirable.
During the process of ongoing surveillance and treatment, the time-series clinical data rapidly accumulate. The data generated from HCC patients are multimodal, frequently obtained at regular interval, and whose impact may not be same in patients' life course 9,10 . Understanding these data may delineate the biological behaviors of HCC and help guide dynamic management 11 .
In the past decade, extensive studies had been conducted to understand the value of the time-series clinical data. Several studies have been reported that the joint models can describe the change in the prognostic value of a single variable measured over time 12,13 , while these models do not support multiple dynamic variables, which limit its clinical application. In the field of machine learning, the methodologies of temporal abstractions 14 and hidden Markov models 15 had been utilized for classifying patients with longitudinal data by building splitting trajectories; these models supported multiple variables and achieved better performance compared to modeling just using cross-sectional data. However, these models are considered opaque since internal structure and learned parameters are difficult for interpretation. Moreover, pure pursuit of the precision in prognosis prediction it not that important as an ideal predictive system should also provide clues in treatment planning 2 .
Patients with HCC are regularly followed up every 1-3 months during the treatment, every 3 months during the first year after complete remission, and later every 3-6 months 2 . Therefore, in this study, we converted the time-series data of patients with Barcelona clinic liver cancer (BCLC) stage B HCC into data of time slices with constant interval of 3 months, and used a process of "Cox-based feature selection" to select key prognostic features to build the first cascading survival map. The survival paths established present superior or equal ability in dynamic prognosis prediction than the conventional staging systems from the time frame of 3-12 months after diagnosis. Based on the map, we identify three paths of long-term survival; three chance nodes with progressive disease but high chance (>80%) to have improved survival after surgery/ablation; one incurable node with progressive disease, poor median overall survival (OS), and no transferred path. The methodology of survival path mapping can be utilized to facilitate dynamic prognosis prediction and treatment planning for patients with HCC in the future.

Results
The baseline characteristic of the patients. The derivation cohort consisted of 979 HCC patients with a median age of 55 (range 15-86) years, the internal testing cohort consisted of 627 HCC patients with a median age of 56 (range 20-89), and the multicenter testing cohort consisted of 414 patients with a median age of 52 (range 19-80). The multicenter testing cohort had a higher proportion of female and young patients than the derivation cohort; there were no significant differences in hepatitis B virus infection, serum alpha-fetoprotein (AFP) level, Child-Pugh class, tumor size, or number of lesions between the two testing cohorts and derivation cohort ( Table 1).
The survival paths built by derivation cohort. Using 3 months as the interval of each time slice, the first 2 years' dataset of derivation cohort was converted into the data of nine time slices, which includes 979, 822, 513, 390, 336, 294, 246, 221, and 202 All values are presented as numbers of patients followed by percentages in the parentheses. P values were calculated by comparing categorical variables between testing cohorts and derivation cohort with chi-square test cases with effective data, respectively; the significance value α was set at 0.006. After completing all the processing cycle of the derivation cohort, the data from time slice 1-9 were divided into 2, 4, 7, 10, 12, 12, 10, 10, and 7 subclasses, respectively; subclasses with <6 cases were excluded from the mapping to reduce the risk of model overfitting. By connecting the class with its derivative classes, a total of 13 survival paths were constructed, which were illustrated in different colors (Fig. 1). In this model of survival path, every bifurcation point is called a node, and each node integrates the information of previous nodes to facilitate prognosis prediction.
Prognostic value of survival path in derivation cohort. In the derivation cohort, the prognostic power between the survival path system, BCLC staging system, AJCC staging system, and ART score system was compared at all nine time slices ( Table 2). The survival path system had superior or non-inferior c-index in predicting OS than BCLC staging system, AJCC staging system, and ART score system from time slice No.3 to time slice No.9. At time slice No.2, the AJCC staging system had superior c-index than survival path system, while no significant difference in cindex between the survival path system and BCLC staging system was found. It was interesting to note that the survival differences between stage B and stage C in BCLC staging system, as well as the differences between stage IIIa, stage IIIb, and stage IVb in AJCC staging system, diminished to insignificance starting at time slice No.5; by contrast, the survival path system presented superior performance in dynamically discrimination the OS of HCC patients (Fig. 2).
Validation of the survival path system in testing cohorts. Generally, the survival curves in the internal testing cohort and multicenter testing cohort fit well with the curves in the derivation cohort (Fig. 3). In the internal testing cohort, the survival path system showed superior and non-inferior prognostic value than the BCLC staging system and AJCC staging system from time slice No.3 to time slice No.5. The advantages of the survival path system diminished starting at time slice No.6. The results in the multicenter testing cohort confirmed the advantages of the survival path system over other two staging systems from time slice No.3 to time slice No.5 ( Table 3). The significance of each path bifurcation was also evaluated in the testing cohorts (Table 4). A P value < 0.05 could be achieved in all the bifurcation with enough (≥6) cases in each of the following comparator nodes, which demonstrated the stability of the survival path system we built.
Long-term survival based on the survival paths system. Of the 13 paths constructed, 3 paths lead to long-term survival (>60 months), including No. 1    Treatment and the path transfer. Of all the nodes in the survival path system, five nodes went down from bifurcated nodes in the previous time slice and bifurcated in the following time slice. These nodes had unfavorable prognosis and the survival path system might provide guidance. Surgery and ablative therapies are considered aggressive management and therefore we described the proportion of patients receiving surgery/ablation in these nodes ( Table 5). The surgery/ablation rates in S (p = 3, ts = 2) , S (p = 5, ts = 3) , S (p = 8, ts = 4) , S (p = 2, ts = 1) , and S (p = 4, ts = 2) were 23.3%, 24.5%, 31.4%, 25.2%, and 13.2%, respectively; candidates who received surgery/ablation had rates of 83.3%, 84.6%, 81.3%, 71.2% and 56.7% going to the upper node in the next time slice, respectively. We define a node meets both following conditions: (1) median OS time of its upper bifurcated node had 10 months higher than that of the lower bifurcated node; (2) more than 80% patients receiving surgery/ablation went to the upper bifurcated node, as a chance node; then the S (p = 3, ts = 2) , S (p = 5, ts = 3) , and S (p = 8, ts = 4) are candidate nodes.
Incurable disease based on the survival paths system. Of all the paths in this map, two paths (No.10 and No.13) only have one valid node after bifurcation. We define a node without following valid node after bifurcation and had a median OS time less than 5 months as an incurable node, then S (p=10, ts=4) is the incurable node.

Discussion
In this work, we developed an analytical model to dynamically trace the prognosis of cancer patients with BCLC stage B HCC. Time slice was employed for data conversion and Cox-based feature selection for constructing the cascade structure of survival path. The survival path model showed superior or non-inferior prognostic value than the conventional BCLC and AJCC staging system from time slice No.3 to time slice No.5, which were confirmed in internal and multicenter testing cohorts. These results suggest that this tool is valuable in dynamic prognosis prediction and treatment planning for patients with intermediatestage HCC during the time frame of 3-12 months after diagnosis. Currently, studies on developing re-staging systems for malignant tumors were always hindered by the lack of effective methods in utilizing with time-series data, and by the fact that different treatments could lead to different re-staging strategies, which restrict the generalization of established models [16][17][18] . In HCC, the BCLC staging system is most widely used for staging and re-staging during clinical practice. This classification uses variables related to tumor stage, liver functional status, physical status, and cancer-related symptoms, and has been validated as the best staging system for treatment guidance 2 . However, the provided information of BCLC staging system was not enough to support dynamic prognosis prediction and real-time treatment planning, as reflected by our results that the survival differences between Stage B and Stage C subgroups gradually decreased over time. Hence we proposed to create a more precise system. The survival path we built for intermediate-stage HCC integrated the time-series information of variables utilized in BCLC classifications, variables on image change after treatment 19 , and variables on important serum markers 20,21 ; although only one selected feature was utilized for node subdivision at each time slice, the model constructed showed superior or non-inferior prognostic value than the BCLC staging system at all time slices in the derivation cohort, indicating that this methodology had great potential.
In the testing cohorts, we observed that the c-index of survival path system decreased at time slice No.6 and the advantages of the survival path system compared to BCLC and AJCC staging systems diminished starting at time slice No.6. Moreover, the cindex of BCLC staging system was significantly higher than the survival path system in the internal testing cohort at time slice No.6. These phenomena may be caused by the fact that no more path bifurcations were made since time slice No.6 due to the Table 2 Comparison of c-index between the survival path system, BCLC staging system, AJCC staging system, and ART score at each time slice in the derivation cohort  a Nodes of survival path system with less than six cases were excluded from the computing of c-index. Therefore, the number of cases in modeling is less than the number of all cases with effective data b The c-index of the interested system was lower than survival path system, with P < 0.006 c The c-index of the interested system was higher than the survival path system, with P < 0.006 restriction of the sample size, which impaired the ability of survival path system in utilizing the latest information of patients. It is conceivable that the predictive value of this system could further be enhanced if we can utilize a big clinical database of thousands of cases, as more refined paths can be constructed and the data in distant time slices can be well analyzed. The survival path model we established can also guide treatment planning for intermediate-stage HCC. In the management of patients on the paths going up, we need to closely pay attention to the key variables that could transfer the patient to unfavorable paths in the following time slices. For example, for patient in S (p = 1, ts = 2) , it is recommended to control the disease with ≤1 viable lesion or 2-3 lesions with maximal tumor size <30 mm by the end of time slice No.3. Complete remission, AFP < 400 ng/ml, and Child-Pugh score A are required to maintain the patient on the No.1 path. In dealing of patients with progressive disease, aggressive treatments like surgery or ablative therapies could be utilized for those in chance nodes, while palliative treatments were recommended for patients in incurable nodes.
The survival paths we constructed for stage B HCC was an initial attempt, and efforts could be made in several aspects to further improve this model/methodology. The first aspect is to use learning algorithms to explore the best cutoff for individual variable and optimizing the process of feature selection, including random forest 22 , k-nearest neighbor 23 , and neural networks 24 ; bootstrap validation could be utilized alongside to ensure quality control and reduce the risk of overfitting. The second aspect is to develop algorithms for node fusion, which may enhance our utilization of the database and give us an insight into the biological behavior of cancer. The third aspect is to develop methods in dealing with irregular time series. Converting the time-series data into time slices of 3 months will result in some missing data, therefore, a learning algorithm that dynamically design the interval of time slice in specific node based on the characteristics of data is needed to maximally utilize the data.
In conclusion, the survival path model constructed in this study offers a superior method for dynamic prognostication for HCC patients during the time frame of 3-12 months after diagnosis, compared with the current BCLC staging system. The methodology utilized in this study also pioneers as an effective tool in processing the clinical big data of cancer patients in the future.     (2017-FXY-129). The Hospital Ethics Committee of the four medical centers approved this study, which waived the need for written informed consent based on the retrospective nature of the study.
Most patients (1852/2020, 91.7%) in the above-mentioned medical centers received TACE as their first-line treatment; 135 (6.7%) patients received surgical resection as initial treatment and 33 (1.6%) patients refused to received treatment. The subsequent therapies after TACE, which constituted of ablative therapies, surgical resection, targeted therapies, or palliative chemotherapy, were adopted based on the decision of the multidisciplinary teams, including hepatologists, radiologists, and interventional radiologists. Patients were followed up monthly during the period of initial treatment, subsequently at every 2-3 months for the first 2 years if complete remission was achieved. The frequency gradually decreased to every 3-6 months after 2 years' remission.
Time-series data on serum tumor markers, biochemical and hematological indices, medical imaging, and associated changes (CT and/or MRI) of each patient were collected. Based on past literatures on staging systems and laboratory tests 2,25 , a total of nine variables were designed, covering imaging results, laboratory tests, and performance status; all the variables were dichotomized ( Table 6).
Transformation of datasets for analysis. To analyze the data, the time-series data were grouped into data at standard time slices. For every patient, the zero  Table 3 Comparison of c-index between the survival path system, BCLC staging system, and AJCC staging system in the internal testing cohort and multicenter testing cohort  The c-index of the interested system was higher than the survival path system, with P < 0.006 b The c-index of the interested system was lower than the survival path system, with P < 0.006  1~24 were transformed into 9 consecutive time slices. For variables measured more than once in each time slice, the newest values were selected to be associated with the time slice. The time slice with complete data of the nine variables was defined as slice with complete data. If the data in one time slice were incomplete or unavailable, but the follow-up suggested the patient was still alive, the data in this slice and subsequent slices were regarded as point with no surveillance. If the patient died or lost follow-up, the data in the following time slice were regarded as nonexistent (Fig. 4a). The primary outcome was OS. For data in the first slice, OS was defined as time from diagnosis to death by any causes. For subsequent time slices, OS was defined as time from image examination of that time slice to death by any causes.
Feature selection and construction of survival path.
Step 1: We start with using the data at first time slice of the derivation cohort, which is denoted S (all; ts = 1) (Fig. 4b). Univariate analysis with Kaplan-Meier (KM) method was utilized to identify candidate variables (X 1 , X 2 , …, X p ). In constructing the survival path, each feature selection process at a specific time slice was considered as an independent experiment. To control the false discovery rate, suppose we have m time slices, the preselected significance level for feature selection in each path was calculated by the formula below 26 : Step 2: Judgment: The sample size required for the Cox proportional hazard regression model with multiple covariates was calculated 27 . If the sample size is larger than the calculated one, the data would proceed to feature selection. Less than calculated sample size will stop any future feature selection process and the group will remain the current classification.
Step 3: Feature selection: All the significant variables detected using the KM method were put into the Cox proportional hazard regression model, which  assumes the hazard as follows, where (x 1 , x 2 ,…, x p ) is a vector of p predictor variables, and β 1 , β 2 ,…, β p are the corresponding regression coefficients, which are the weights given to each variable by the model. The original model included all the candidate variables and was presented as follows: The backward elimination (BE) procedure was carried out, with the following p tests, H 0j : β j ¼ 0; j ¼ 1; 2; ; p, the lowest partial F-test value F l corresponding to H 0l : β l ¼ 0 is compared with the preselected significance values F 0 . If F l < F 0 , then F l can be deleted and the new original model is: Then, a stepwise BE procedure was continued, until all F l > F 0 , and the model is what we choose. The importance of each variable in the fixed Cox model can be obtained as follows: where L h refers to the likelihood of the fixed model and L h − q refers to the likelihood of model after elimination of the variable X q . The variable eliminated from the model with the maximal change of −2log likelihood was selected.
Step 4: Based on the selected dichotomized variable X q , the cohort S (all; ts = 1) can be divided into two subgroups S (−q; ts = 1) and S (+q; ts = 1) . The data of the two subgroups at the next time slice, i.e., the S (−q; ts = 2) and S (+q; ts = 2) will repeat steps 1-3, respectively (Fig. 4b). If there is no variable selected, the cohort stays the current classification and the data in next time slice will repeat step 1-3.
Step 5: Graphic representation: The survival path was constructed and visualized using two-dimensional graph, with the time slices on x-axis and median OS time on y-axis.
Statistical analysis. Pearson χ 2 test was used to compare categorical variables between groups. To compare the efficacy in dynamic prognosis prediction between the survival path method and conventional staging systems, the measurement of cindex in each time slice was computed and compared using Z test method. All analyses were performed using SPSS version 20.0 (IBM Corporation, USA) and R version 3.3.2 (The R Foundation for Statistical Computing, 2016).
Data availability. All the relevant raw data that support the findings of this study have been deposited in the Dryad Digital Repository (https://datadryad.org//) datasets (doi:10.5061/dryad.pd44k8r). In addition, the authenticity of this data has also been validated by uploading the critical raw data onto the Research Data Deposit public platform (www.researchdata.org.cn), with the approval RDD number as RDDA2018000603.