A new semi-supervised learning model combined with Cox and SP-AFT models in cancer survival analysis

Gene selection is an attractive and important task in cancer survival analysis. Most existing supervised learning methods can only use the labeled biological data, while the censored data (weakly labeled data) far more than the labeled data are ignored in model building. Trying to utilize such information in the censored data, a semi-supervised learning framework (Cox-AFT model) combined with Cox proportional hazard (Cox) and accelerated failure time (AFT) model was used in cancer research, which has better performance than the single Cox or AFT model. This method, however, is easily affected by noise. To alleviate this problem, in this paper we combine the Cox-AFT model with self-paced learning (SPL) method to more effectively employ the information in the censored data in a self-learning way. SPL is a kind of reliable and stable learning mechanism, which is recently proposed for simulating the human learning process to help the AFT model automatically identify and include samples of high confidence into training, minimizing interference from high noise. Utilizing the SPL method produces two direct advantages: (1) The utilization of censored data is further promoted; (2) the noise delivered to the model is greatly decreased. The experimental results demonstrate the effectiveness of the proposed model compared to the traditional Cox-AFT model.


Few available data versus high dimensional covariates dilemma
The Cox model is one of the most widely used methods in cancer analysis which can assess patients' survival risk and classify the patients into 'high risk' and 'low risk' groups using the gene expression profile. However, the lack of enough information in the labeled dataset tends to conduct the issue of the inaccuracy of prediction. Trying to solve this dilemma, the AFT model is employed to estimate the true survival time for the censored data, and therefore more disease information in the censored data can be delivered to the Cox model, which can help Cox model to produce better predictions.

Similar phenotype disease data versus different genotype cancer dilemma
Recent research pointed out that similar phenotype cancers may be completely different diseases on the molecular genotype level 19,20 , and hence the AFT model cannot use some cancer data which have the same phenotype directly but with different molecular genotypes directly. The Cox-AFT model alleviated this dilemma by using the Cox model to classify the cancer data firstly because the cancers with different molecular genotype levels may lead to the different risks of the patients.
In the Cox-AFT model, the Cox model was used to classify the similar phenotype disease data into 'low risk' and 'high risk' subgroups, and these subgroups will be sent into the specific AFT model to get approximate estimate of survival time for the censored data. At last, these pseudo labeled censored data will be fed into the Cox model as labeled data.
Though effective to some extent, the Cox-AFT model suffers from the robust issues caused by heavy noise and even outliers. We found that many censored data always violate the constraint that the estimated survival time is supposed to be longer than the censored time. Therefore these falsely labeled samples are dismissed in this model, which restricts the full exploitation of the censored data. Furthermore, the samples satisfying the constraint may not be estimated correctly in the stage of the AFT model. Fed with such data with label noise, the Cox model may be evidently degenerated and its performance may be more or less harmed to the next training cycle.
The reliability and stability of the Cox-AFT model relies heavily on the accuracy of the AFT model. However, the single AFT model always encounters the robust issue in semi-supervised learning scenarios. In the first few iterations of the AFT model, censored samples have high chance to be wrongly labeled due to the inaccurate model parameters. Worse still, the AFT model utilizes all the labeled censored data to conduced model learning, and as a result the noisy information remains in the following iterations. Therefore the selection of samples values a lot in the training of the AFT model.
To solve the issues mentioned above, we introduce a robust learning mechanism called self-paced learning (SPL). The self-paced learning 21 was proposed based on the core idea of the curriculum learning (CL) 22 . Curriculum learning (CL) simulates the learning process of human beings and tend to learn easy samples first and then gradually include more complex samples into training process. The challenge in CL is the requirements of the prior knowledge about the sample easiness order. Compared to CL, SPL can identify the easy and hard samples adaptively according to what the model has already learned and gradually add harder samples into training. The SPL method has been used successfully in multiple machine learning tasks [23][24][25] . Moreover, 26 has proved the robust insight of SPL regime, by proving the equivalence between the optimization of SPL objective function and the majority minimization of a non-convex penalty. Hence SPL is a powerful robust learning regime to help us estimate the patients' survival time more accurately.
In this paper we introduce the SPL regime in the Cox-AFT model (Cox-SP-AFT), largely improving the model capacity in the presence of heavy noises and outliers. SPL is embedded into the AFT model and takes effect by automatically selecting samples following the "easy" to "hard" mode in the training process, which means learning samples of high confidence first and gradually considering more complex ones. This learning mechanism leads to more accurate estimation for the censored samples compared to that without SPL and brings many benefits. A comparison experiment between Cox-AFT models with and without considering SPL is shown in the Experiment section. It is verified that the Cox-SP-AFT model can select more correct disease-related genes, estimate the patients' survival time more accurately and employ more censored data, validating the superiority of our proposed semi-supervised learning model with SPL.

Method
Suppose that the dataset includes l samples consisting of complete dataset and censored dataset to study the correlations between the gene expression profile X and according survival time Y.
Cox proportional hazard model. The Cox proportional hazard model is used to classify the patients into two groups of the 'low risk' and 'high risk' , and the baseline hazard function can be expressed as: Minimizing the Cox's partial log likelihood function: In fact, some correlation coefficients β i of the i th gene may be zero in the true model, which means that not the whole covariates have effect to the prediction. Therefore the model should be able to identify the nonzero coefficients in the gene expression profile; a regularization part was added to solve this problem. So the penalized Cox model with penalty function can be expressed as: where λ is a tuning parameter and P(β) is the regularization term.
In recent years, methods with different regularization terms such as elastic net, L 1 , MCP and L 1/2 have been used in cancer survival analysis. In our semi-supervised learning model, we use the MCP regularization. This combination has good performance in sparsity and data-fitting ability. The derivative of the MCP can be expressed as: In the AFT model of our model, censored data are initially labeled using the Kaplan-Meier weight estimator because it's simple and fast. Trying to get more accurate results in a robust way, the SPL regime is integrated in the AFT model.
Self-Paced Learning. Curriculum Learning was first proposed in 16,17 , which follows the learning principle of humans. Afterwards, 16 formulates the key principle of CL as a concise optimization model through introducing a regularization term. The SPL objective function includes a weighted loss term on all samples and a general self-paced regularization can be expressed as: is a weight vector of all samples. λ is the age parameter for controlling the learning pace, and λ f v ( , ) is the self-paced regularization term imposed on the sample weight. By optimizing the weight vector V with gradually increasing age parameter, more samples can be automatically selected into the training process from easy to complex in a purely self-paced way. There are several variants to the original hard regularization , such as the linear and mixture SP regularization in 21 .

SP-AFT Model.
The SP-AFT model updates the original AFT model through additionally embedding the SPL regime, and inherits the priorities of SPL method, such as better robustness and higher accuracy. The specific objective function of SP-AFT model is given by adding weights to the censored data as well as a self-paced regularization term: where m, n are the numbers of labeled samples and censored samples, respectively.
are the model parameter, label variable, the weight term and the loss for the censored sample j j denotes the self-paced regularization term imposed on the weight term v j as well as the age parameter α. The age parameter controls the learning pace and the larger value will allow more complex samples into training. β P( ) represents the MCP regularization term on β.
Alternate Optimization Search(AOS) algorithm is adopted to optimize SP-AFT model. The detailed optimization procedure is presented below: Initialize. Some optimization variables and parameters are preset in this step. For the censored data set, the survival time of each sample is estimated with the Kaplan-Meier method. = ...
is an all-one vector of R m . λ is set to a small value to include several samples into training in the first round. Update β (t) . β will be updated by the AFT model with the MCP regularization utilizing the complete data and censored data with non-zero weight. In this implementation, loss function is adopted as follows: This loss function is derived from the constraint that the survival time must be no less than the censor time. Therefore, if the estimated survival time of a sample is less than the censor time, this sample must be falsely labeled and its loss value is positive infinity. However, if a censored sample obeys the censor condition, its loss function is square loss. Therefore, Formula (7) degenerates to the following AFT model: Where I denotes the sample set of complete data and censored data with non-zero weight − v j t ( 1) . We employ the minimax penalty here and the off-the-shelf methods to solve (9).
. The physical meaning of this step is to select confident samples from the censored dataset according v j . With this step of selecting high-confidence samples, the robustness of SP-AFT can be largely improved compared to that of the AFT model. Calculate the derivative with respect to v j of (7): Through such simple calculation, we can get the closed-form updating equation for v j : The samples with losses smaller than age parameter α will be seen as 'easy' ones and assigned as = v 1 j ; Otherwise will be signed as = v 0 j . At the first start, the weight values of the censored data are all set to 1, and we suppose that they are confident samples in the first iteration because the Kaplan-Meier estimate is a good but primary approximation to the real survival time. In the following iterations, confident samples are selected according to the loss value. A sample with loss value no more than the age parameter λ will be picked.
. In this step, we update the estimated survival time for the censored data with the learned parameter β (t) as well as the weight:

This updating formula indicates that if
, the sample x j will be assigned the newly estimated survival time. Otherwise, the estimated value will remain unchanged. Once the censored samples are pseudo estimated, the age parameter λ is enlarged to include more censored samples with larger losses into training. The iteration will stop until convergence. Cox-SP-AFT model. The work flow of our proposed semi-supervised learning model is shown in Fig. 1. In a training round of Cox-SP-AFT model, the training samples are firstly put into the Cox model penalized with MCP regularization, and the constructed model will classify the whole dataset into 'high-risk' and 'low-risk' groups. Then the two groups will be sent into their according SP-AFT models, respectively. In the SP-AFT model, the survival time of the censored data will be estimated. However, some estimated time of censored samples were less than the censored times. It is obvious that these censored samples were wrongly labeled. Thanks to the mechanism of SPL method, these samples with large losses will be automatically assigned zero weights and take no effect in the next iteration. At the terminal iteration of SP-AFT, reliable labeled samples (with non-zero weight) will be added to the labeled dataset thus updating the training set of the next Cox-SP-AFT round. The algorithm of our proposed Cox-SP-AFT model is outlined below:

Results
We designed the simulation scheme as in 27 . The simulation data were generated as following: Step 1: We set the dimension of the genes p = 2000, in which 20 corresponding coefficients of the related genes were nonzero, and the coefficients of the remaining 1980 genes were zero. The censored rate k was set 0.5; the correlation coefficients c was set 0.3. The number size n of the whole dataset was 250.
Step 2: We generate the γ γ γ … , , , Step 3: The survival time was computed as: , where U is the uniformly distributed variable, α is the shape parameter and the ω is the scale parameter.
Step 4: The censored time point was decided in random selection, and the censored time ′ y i was computed as where the y i is the true survival time, ′ y i is the observed time, X i is the gene expression profile and δ i represent the data is censored or not.
We generated 10 datasets through setting different β values of random selected genes, 200 random selected samples in each dataset were used as the training data and the remaining 50 samples were used as the test data each time, in this paper we compared five methods including three supervised learning and two semi-supervised learning models, the supervised learning methods were penalized by elastic net, lasso and MCP respectively. The difference between the two semi-supervised learning models is they contain the self-paced learning or not. Different methods in each dataset were evaluated 100 times and the average results were shown in below. Table 1 is the number of the selected correct genes obtained by five different Cox methods, three supervised learning methods: the elastic net penalized Cox model (Cox-EN), the lasso penalized Cox model (Cox-lasso) and the Cox mode with MCP (Cox-MCP), the other two methods are Cox models in semi-supervised learning models with or without self-paced learning (Semi-Cox or SP-Semi-Cox). The last row shows the average values of the results obtained by different methods. We can find the number of selected correct genes obtained by  Table 2. The number of the total selected genes obtained by different methods. Cox-lasso and Cox-MCP are nearly the same; the Cox-EN selected more correct genes than the Cox-lasso and Cox-MCP. However the performance of the semi-Cox without SPL is better than Cox-EN, and it is obviously that the SP-Semi-Cox model selected most correct genes. The numbers of the total selected genes obtained by different methods were shown in Table 2, the Cox-EN selected most genes, it means there may be many genes unrelated to disease. The results obtained by Cox-lasso and Semi-Cox are nearly the same, the SP-Semi-Cox selected less genes compared to the Semi-Cox without SPL but more than the Cox-MCP, and the supervised learning method Cox-MCP selected least genes.
The accuracy of correct gene selection obtained by different methods were shown in Fig. 2, it is obviously that the accuracy obtained by the SP-Semi-Cox is highest in the five methods. The accuracy of Semi-Cox is higher than other three supervised learning methods, and the accuracy of Cox-EN is lowest because it selected many unrelated genes. Compared the performances we can say though the Cox-MCP selected fewer genes than SP-Semi-Cox, but it cannot find more correct related genes, our SP-Semi-Cox is more efficient, the results proved our model has the strongest ability to find the cancer related genes. The survival curves obtained by different Cox methods in one dataset are shown in Fig. 3, the red line is the survival curve of high risk patients, and the green line is the survival curve of low risk patients. We find the two  survival curves obtained by the three supervised learning methods, both have some places overlap or intersect, however seeing the survival curves obtained by SP-Semi-Cox, the classification performance was best, and two curves with different colors did not intersect.
To further evaluate the accuracy of the model, we use the Concordance Index (CI) which can be determined as: where t t , i j are the survival time of the patients i and j, f(.) is the survival risk function, the values of CI are between 0 and 1, and the higher value means the higher accuracy the method obtained.
The average CI obtained by different methods in 10 simulation datasets are shown in Fig. 4, we can find the CI obtained by semi-supervised learning models is higher than which obtained by the three supervised learning methods, the performance of the Cox model in our semi-supervised learning model with self-paced learning is best, the Cox model in semi-supervised learning model without self-paced learning perform better than the three supervised learning methods but worse than the Cox-SP-AFT model. Table 3 shows the MSE of estimated time obtained by five different AFT methods in different models: the elastic net penalized AFT model (AFT -EN), the lasso penalized AFT model (AFT -lasso), the AFT mode with MCP (AFT -MCP), the AFT models in semi-supervised learning models (Semi-AFT) and the AFT models in         Table 9. The selected genes obtained by different methods in GSE10141. semi-supervised learning models with SPL (SP-Semi-AFT). The last row shows the average values of the MSE obtained by different methods. It is easy to find the Semi-AFT is better than the other three supervised learning AFT models, and SP-Semi-AFT model has the best performance among these five models, it means our model with SPL can predict the patients' survival time accurately. Comparing the MSE obtained by three supervised learning AFT models, the results are much close.

Discussion
In order to further evaluate the performances of different methods, these methods were applied on four gene real datasets which were collected in Gene Expression Omnibus (GEO): GSE3141, GSE10141, GSE22210, GSE26389. GSE3141 has the information about the gene expression profile and the clinically relevant associations with disease outcomes in cancer 28 . Some data about the patients who after undergoing potentially curative treatments for hepatocellular carcinoma were recorded in GSE10141 29 . GSE22210 contains the gene expression profiling of the breast cancer patients 30 . GSE26389 is the dataset which contain the gene information about the gastric cancer patients 31 . Some details about these different cancer datasets are given in Table 4. The first column is the number of the genes, the second is the number of the samples, and then is the number of labeled data in the dataset (remaining data are the censored data), the last second column is the number of training data, and the last column is the number of the test labeled data we used in the experiments. The experiments results were the average values of the 100 experiments on the corresponding datasets. The numbers of selected genes are shown in Table 5, it is easy to find no matter in which dataset, the Cox-MCP always selected the least disease related genes, the SP-Semi-Cox selected more gens than Cox-MCP but less than other three methods. Comparing the remaining three methods, the Semi-Cox selected fewer genes than the Cox-EN and Cox-lasso, the Cox-EN selected most genes in the real dataset experiments. Though the SP-Semi-Cox cannot select the least genes, its accuracy is the highest as shown in the simulation experiments; this means that researchers will be most likely to identify genes associated with the disease by using our model selected genes. The average CI obtained by different methods in different real datasets is shown in Table 6. The CI obtained by Cox-lasso is always lowest; however the gap between the three supervised learning methods is small. Compared the CI obtained by three supervised learning methods, the CI obtained by the Cox models in semi-supervised learning models were higher. We also found the performance of SP-Semi-Cox is better than Semi-Cox, it means the self-paced learning can improve the semi-supervised learning model obviously. Table 7 gives the MSE obtained by different methods in the real datasets. We get the same conclusion as in the simulation experiments: The SP-Semi-AFT has the best performance for predicting the patients' survival time, and the MSE obtained by Semi-AFT without SPL is lower than the other three supervised learning AFT models. Additionally, the performance of AFT -EN is better than the AFT-MCP, and the AFT-lasso has the highest MSE in the real data experiments.
The 10 top-ranked disease related genes selected by different Cox models in different real datasets were shown in Tables 8-11, the names in bold were the selected genes by different methods, and the genes with star(*) means these gene were only selected by SP-Semi-Cox method in Cox-SP-AFT model.
It is very obviously that there are many genes which are selected by different methods at the same time in different datasets, such as PXN in GSE3141, NTRK3 in GSE10141, VBP1 in GSE2210 and KDM5A in GSE26389. PXN encodes a cytoskeletal protein involved in actin-membrane attachment at sites of cell adhesion to the extracellular matrix, and it has been proved to be positively correlated with the clinic pathological factors of colorectal cancer 32 . NTRK3 encodes a member of the neurotrophic tyrosine receptor kinase (NTRK) family, the mutations in NTRK3 have been proved to be associated with breast carcinomas and other cancers in clinical 33 . VBP1 plays a role in the transport of the Von Hippel-Lindau protein from the perinuclear granules to the nucleus or cytoplasm, the mutation and loss of VBP1 may be related to the renal-cell carcinoma development 34 . The encoded protein of KDM5A plays a role in gene regulation through the histone code by specifically demethylation lysine 4 of histone H3, many researchers thought this gene may play a role in tumor progression 35 .
On the other hand, the Cox-SP-AFT model selected some unique genes compared other methods, BDNFOS in GSE314, CUL2 in GSE10141, AFP in GSE22210, EIF4H in GSE26389. BDNFOS is encoded as a member of the nerve growth factor family of proteins, and it plays a role in the regulation of the stress response which was  Table 11. The selected genes obtained by different methods in GSE26389.
said may be related to the lung cancer 36 . The mutational of CUL2 may play an important role in many human cancers 37 . The alpha-fetoprotein encoded by AFP is a major plasma protein which is often said to be associated with hepatoma or teratoma 38 . The encoded translation initiation factors of EIF4H can be used to stimulate the initiation of protein synthesis at the level of mRNA utilization, controlling this gene translational may make key contribution translational control in tumor promotion 39 . These genes which are mentioned in the literature demonstrated that our semi-supervised learning model can identify the real cancer related genes on the other hand.

Conclusion
In this paper we propose a new semi-supervised learning model by combining the Cox and SP-AFT models using cancer data of high dimension and low sample size. The Cox model is used to classify the cancer patients and then the SP-AFT model can robustly predict the censored data. The embedded self-paced learning regime helps our model learn from censored data in a purely self-paced manner. To conclude, our proposed Cox-SP-AFT model can utilize more censored samples and estimate their survival time with more accuracy. Therefore, the proposed semi-supervised system is supposed to achieve higher reliability and stability. Moreover, with the aid of SPL mechanism, this model will be an efficient and versatile tool to make great contributions in cancer survival analysis.