Analysis of microarray right-censored data through fused sliced inverse regression

Sufficient dimension reduction (SDR) for a regression pursue a replacement of the original p-dimensional predictors with its lower-dimensional linear projection. The so-called sliced inverse regression (SIR; [5]) arguably has the longest history in SDR methodologies, but it is still one of the most popular one. The SIR is known to be easily affected by the number of slices, which is one of its critical deficits. Recently, a fused approach for SIR is proposed to relieve this weakness, which fuses the kernel matrices computed by the SIR application from various numbers of slices. In the paper, the fused SIR is applied to a large-p-small n regression of a high-dimensional microarray right-censored data to show its practical advantage over usual SIR application. Through model validation, it is confirmed that the fused SIR outperforms the SIR with any number of slices under consideration.

T where a notation of  represents a statistical independence and q ≤ p. The conditional independence statement (1) indicates that the two conditional distributions of Y |X and Y |M T X are equivalent, so X is replaced by M T X with preventing loss of the information about Y |X. A subspace spanned by the columns of M satisfying (1) is called a dimension reduction subspace. If the subspace acquired by intersecting all possible dimension reduction subspaces is still a dimension reduction subspace, the intersection subspace is defined as the central subspace S Y|X 1 . The central subspace is minimal and unique, and its restoration is the main purpose of SDR literature. Hereafter, notations of d and η ∈ R p×q represent the true dimension and orthonormal basis matrix of S Y|X , respectively. The dimension-reduced predictor η T X is called sufficient predictors.
Data, whose sample size n is smaller than p, such as microarray data, high-throughput data, etc., are quite popular these days. In such data, so-called curse of dimensionality usually occurs, so a proper model-building are often problematic in practice. Then the SDR of X through S Y|X can facilitate a model specification, so it turns out to be practically useful in such data.
One of the most popular SDR methods should be sliced inverse regression (SIR 2 ). Implementation of SIR requires a categorization of a response variable Y, called slicing, and the selection of the appropriate number of slices are often critical in the application results. So far, any ideal or recommended selection guidelines to choose the number of slices are not yet known. To overcome this, a fused approach is proposed in 3 by combining sample kernel matrices of SIR constructed by varying the numbers of slices. The combining approach in 3 is called fused sliced inverse regression (FSIR). According to 3 , FSIR results in robust basis estimates of S Y|X to the numbers of slices.
The purpose of this paper is to analyze a micro array right-censored survival data by implementing fused sliced inverse regression (FSIR) by 3 . The performances of FSIR will be compared with the usual SIR applications with different numbers of slices. The organization of the paper is as follows. The SIR and FSIR along with the applicability to survival regression is discussed in section 2. In the same section, the permutation dimension test is discussed. Diffuse large-B-cell lymphoma data is analyzed through SIR and FSIR, and their results are compared in section 3. We summarize our work in section 4. . Letting S Y|Z be the central subspace for a regression of Y|Z, then the relationship that S Y|X = Σ −1/2 S Y|Z holds. Define η z be p × d orthonormal basis matrix for S Y|Z . Consider the so-called linearity condition: (C1) E(Z|η z T Z) is linear in η z T Z. According to 2 , a proper subspace of S Y|Z can be constructed under linearity condition: For estimating of S Y|X completely, it is typically assumed that S(Σ −1 E(X|Y)) = S Y|X . The so-called sliced inverse regression is a method to recover S Y|X by computing E(X|Y).
In population, the quantity E(Z|Y) should be computed without any specific assumptions on Y |Z. If Y is discrete with h levels, E(Z |Y = s) is the average of Z within the sth category of Y. Following this idea, if Y is continuous or many-valued, Y is transformed to a categorized response  Y with h levels. Then E(Z|Y  = s) becomes the average of Z within the sth category of  Y for s = 1, …, h. This categorization of Y is called slicing, which is done for each category to have equal numbers of observations. The SIR constructs: In sample structure, the algorithm of SIR is as follows: 1. Construct  Y by dividing the range of Y into h non-overlapping intervals. Let n s be the number of observations for the sth category of  Y for s = 1, …, h.

Compute
corresponding to first d largest eigenvalues are the estimate of an orthonormal basis for S Y|Z .
to have the estimate of an orthonormal basis of S Y|X .
As we can see the implementation of SIR in practice, the results may critically vary depending on the selection of h. This is discussed in 3 . Define that M FSIR(h) = (M SIR(1) , …, M SIR(h) ), where M SIR(h) stands for the kernel matrix of SIR with h slices. Since S(M SIR(k) ) = S Y|Z for k = 2, 3, .., h, we have.
Here a permutation approach is adopted to implement the dimension estimation. An advantage of the permutation test is no requirement of the asymptotics of Λ m . The permutation test algorithm is as follows:    T X X (2) T However, since the true survival time T cannot be completely observed due to censoring, the direct study of T|X cannot be usually done.
Instead, the data (Y i , δ i , X i ), i = 1, …, n, are collected as n independent and identically distributed realizations of (T, C, X), where Y = T δ + C(1 − δ), δ = 0, 1 is an indicator variable whose value is equal to 1, if δ(C > T) = 1 and 0, otherwise, and C stands for a censoring time. This type of censoring is called right-censoring. Using (Y i , δ i , X i ), the regression of T|X is replaced as follows. The first step is a consideration of a regression of (T, C)|X. The construction of (T, C)|X directly implies that S T|X ⊆ S (T,C)|X . According to 5 , the central subspace S (Y,δ)|X from a bivariate regression of (Y, δ)|X is informative to S (T,C)|X , because S (Y,δ)|X ⊆ S (T,C)|X . Since (Y, δ, X) are collected for survival analysis, the estimation of S (Y,δ)|X can be done. The two regressions of T|X and (Y, δ)|X are connected in 3 under condition: (C2) C  X|(η T X, T). Conditionc2 is weaker than C  (T, X), which is normally assumed in in survival analysis. Then, condition C2 guarantees that statement (2) is equivalent to (T,C)X|η T X, so we have S (T,C)|X = S T|X . Therefore, the following relation directly implied: According to 5,6 , the equality would normally hold, because proper containment requires carefully balanced conditions. Then, SIR and FSIR are directly applicable with bivariate slicing of Y and δ to recover S T |X . Similar discussion about this can be found in section 4.2 of 6 .

Analysis of diffuse large-B-cell lymphoma data. The diffuse large-B-cell lymphoma dataset (DLBCL 7 )
contains measurements of 7399 genes from 240 patients obtained from customized cDNA microarrays. For each patient, his/her survival time was recorded and varied from 0 to 21.8 years. The total uncensored cases (deceased) are 138 among 240 patients. More detailed description on the data is founded in [6][7][8] .
We follow the approach in 9 to analyze the DLBCL. The DLBCL is randomly divided into the training set of 160 and the test set of 80. As usual, the training set is used for model-building, and the test set is utilized for model-validation. First, the 7399 genes in the training set, which are denoted as X tr , are initially reduced to their 40 principal components through principal component analysis. Letting . For model-validation, the predicted scores and the corresponding area under ROC curves for prediction of survival time from 1 to 10 years for both the training and test sets were computed. For the test set, the dimension-reduced predictors are defined as X T te η , where η is obtained from the training set and X te stands for the predictors in the test set. The area closer to one indicates better estimation.
One potentially arguable issue in the analysis in the context should arise on the selection of the number of slices h in the SIR application. As discussed in the previous section, its performance inevitably depends on h. To investigate how serious they impact on the model-validation, we consider h = 4, 6, 8 and 10 for SIR along with FSIR. Following the guidance of 3 , 10 slices are used in FSIR. The area under ROC curve for the training and test sets are reported in Fig. 1.
First, we see the areas under ROC curves for the training set in Fig. 1(a). Larger areas indicate better prediction performances. For the SIR application, the smaller numbers of slices show the better performances. The FSIR is not best among the all application of SIR considered here, but there are no notable differences to the best results, which is with h = 2, among all the SIR applications. Therefore, for the training set, the FSIR is not cause of concern at all. In the case of the test set in Fig. 1(b), the FSIR shows better prediction performances than any of the SIR applications. The prediction results by the FSIR is consistent in both the training and test sets, while the usual SIR applications are very sensitive to the choices of h, as expected. The application of the FSIR to the data is concluded to be successful.

Discussion
According to Fig. 1(a,b), the areas under ROC curves for the training and test sets are reversed against h in the SIR applications. In the training set, smaller numbers of slices have larger areas, while the areas with smaller numbers of h become smaller in the test sets, which is even below 0.5. The area equal to 0.5 is often used as the cut-off. Therefore, for SIR, the application with h = 10 alone is above 0.5 in both train and test sets, although its performance is worst among the others in the train set. The FSIR, however, shows reliable and consistently good performances in both training and test sets.
The conclusion Fused sliced inverse regression (FSIR) proposed by 3 solves the sensitiveness of slice inverse regression (SIR 2 ) to the number of slices by combining SIR kernel matrices. In this paper, the fused sliced inverse regression is applied to high-dimensional microarray right-censored data to show the potential advantage to large p-small n data over the usual SIR application. The predictors are initially reduced through principal components analysis, and then SIR and FSIR are implemented with 40 principal components. According to model-validation, the SIR reveals its sensitiveness to the number of slices. Moreover, ironic validation results are observed in the training and test sets. For SIR, the numbers of slices to have better performances in the training set show worse performances in the test set. This may be because good slicing schemes in the training set do not coincide with that in the test set. This is confirmed again through the estimation of the true structural dimension. However, FSIR shows better performances in the training and test sets than all SIR-application under consideration. This proves a practical advantage of FSIR over SIR.
The usage of FSIR can improve the accuracy in high-dimensional data analysis, which often arise in many scientific fields including biological sciences, so it can contribute to discover new founding in the many science areas.