Elastic Correlation Adjusted Regression (ECAR) scores for high dimensional variable importance measuring

Investigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores’ performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables’ predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.


Scientific Reports
| (2021) 11:23354 | https://doi.org/10.1038/s41598-021-02706-0 www.nature.com/scientificreports/ studies 19 . Another example is stability selection 20 , which is based on linear models and has flavors of both lasso and random forests. Comparing with the lasso, it is more suitable for variable selection, but the price to pay is the reduced power to identify the true signals. CAR scores 21 and CARS scores 22 also fall into this group, they are easy to calculate, but might not be flexible enough when the noise in data is too small or too large. Due to these limitations of current approaches, we developed the Elastic Correlation Adjusted Regression (ECAR) score for simultaneously variable selection in high dimensional biological data. This method is an extension of the CAR scores 21 and improves over the CAR scores in terms of selection accuracy by adjusting the parameter according to different datasets' characteristics. Specifically, the ECAR scores aim to rank the truly influential variables as high as possible. To determine the final selected set, we apply the adaptive false discovery rate density approach 23 . We compared the ECAR scores' performance to lasso, stability selection, ridge, CAR scores, and Sure Independence Screening 24 (SIS) on three classes of datasets: simulated datasets with a fixed correlation structure, semi-synthetic datasets generated from mRNA expression data, and real datasets from T3/ barley database. In our study, we showed that ECAR improves CAR and rivals popular methods like the lasso in terms of the variable selection accuracy and the predictive power of high-rank variables.

Results
The idea of ECAR scores. Suppose we have random variables ( X p×1 , Y 1×1 ) , where X = (X 1 , . . . , X p ) T denotes the genomic features and Y is the outcome, ECAR scores ω are defined as where R is the correlation matrix of the feature space, and R XY is the Pearson correlation coefficient vector. R −α is the αth power of the real symmetric matrix R , which is obtained by first computing the spectral decomposition of R = Q Q −1 , and subsequent modification of the resulting eigenvalues R −α = Q −α Q −1 . When α = 0 , the ECAR scores are equivalent to the Pearson correlation coefficients. When α = 1 , it is equivalent to the semipartial correlation coefficient. CAR fixes the parameter α at 0.5, and therefore it might be in the middle of the marginal method ( α = 0 ) and the conditional method ( α = 1).
We believe that it is not reasonable to fix α at either 0 (Pearson correlation) or 0.5 (CAR). A small example can illustrate this idea. Suppose X 1 ∼ N(0, 1), X 2 = X 1 + ε, X 3 ∼ N(0, 1), Y = X 2 + 0.5X 3 , where ε ∼ N(0, 1) . Figure 1 shows the diagram and contributions of the three variables. The area of the three circles represent three variables' univariate contribution to Y , which can be seen as the coefficient of determination ( R 2 ) of 3 univariate regressions. X 1 circle overlaps with X 2 circle since X 1 is not in the data-generating model-its contribution is borrowed from X 2 . Generate 1000 samples from our model and set α = 0 , the ECAR scores of X 1 , X 2 , X 3 are 0.56, 0.81 and 0.24 respectively. We noticed that even if X 1 plays no role in generating Y , its score is higher than X 3 , which is in the model. Set α = 0.5 and this results in the scores of 0.3, 0.75, and 0.24. X 1 's score is reduced, however, it is still larger than X 3 's score. If we set α = 1 , the three scores will be − 0.03, 0.83, and 0.24, the absolute values of which are much more reasonable in terms of evaluating their actual contributions.
Fixing the α at one is also not ideal. In this case, the scores are equivalent to multiple regression coefficients when X and Y are standardized. This seems to be more suitable to locate the truly influential variables, since the non-influential variables will contribute nothing to the outcome given other variables in theory. However, accurate estimation of scores may be difficult as a result of the high degree of correlation among genomic factors 25 . This problem can be solved by shrinkage methods, which means only a limited amount of information from the correlation matrix will be used, and this amount can be seen as parameter α in the ECAR scores.
α should be adjusted in each dataset to achieve a reasonable extent of compromise between two extremes (multiple regression coefficients and Pearson correlation coefficients). As α varies from 0 to 1, it borrows more and more information from the correlation matrix, and the ECAR scores tend to be more like multiple regression coefficients. The ECAR scores can also be seen as the correlation coefficient between the outcome Y and the transformed features R −α X . The transformed features R −α X will tend to be more "dislike" its original version X as α increases, and this idea is illustrated in Fig. 2.
Best α moves towards 1 when R 2 increases, which is a trend consistently observed in our simulations and applications. To estimate α of ECAR scores in different datasets, we proposed a method that can maximize the variable selection power, which is discussed in detail in the Methods section. In brief, first, we need to estimate the R 2 and the number of influential variables s , and then we randomly select s variables and simulate a new dataset using a linear model which has the same X , R 2 and s as the real dataset. The α which has the best variable www.nature.com/scientificreports/ selection performance measured by Area Under Prediction-Recall Curve (PR-AUC) can be found in the new dataset, as truly influential variables are known. This process is performed many times, and we take the median of the best α to be our final estimate, which performs well in general under different correlation structures. In this way, we can predict the α which ranks the influential variables as high as possible in the real problem. This method works well because while we know nothing about the truly influential variables and their correlation structure, we can estimate the parameter by simulating the possible scenarios. The estimated α would remain reasonable in the real setting as long as it is not too far from our simulated ones.
Comparison of feature selection accuracy on simulated datasets. We compared the performance of ECAR with CAR, SIS, ridge, lasso, stability selection (Details of these methods can be found in Methods section) on 500 simulated datasets consisting of 200 observations and 600 features. The correlation matrix of the features is block diagonal with the compound symmetry structure which was used in the previous research 26 . It is constructed by two equally sized blocks. In each block, the correlation between any two features is 0.25, while variables from different blocks are independent. We used the method described in the Methods section to determine the best α . Assume in the linear data-generating model (2) that there are 30 influential variables randomly selected from the first block, and their corresponding coefficients are sampled from the uniform distribution with minimum 0 and maximum 1. ε is normally distributed with mean 0 and variances σ 2 .
We adjusted σ to synthetic five groups of datasets. Each group consisted of 100 datasets and achieved a different level of the R 2 . Using the methods described in the Methods section to estimate α , we noticed that when R 2 is controlled at 0.2, 0.4, 0.6, 0.8, 0.95, the medians and standard deviations of 100 best α are 0.225 ± 0.22, 0.350 ± 0.17, 0.450 ± 0.13, 0.600 ± 0.10 and 0.750 ± 0.08, respectively. These median values were substituted into the ECAR scores for comparison with other methods on these datasets. The true positives path, as well as the medians and standard deviations of PR-AUC, are shown in Figs. 3 and 4 (a truncated version of Fig. 3) for each method. Under all R 2 settings, the path of ECAR is among or near the highest paths. The results also demonstrate ECAR's advantage of flexibility: as the R 2 drops, SIS outperforms other methods and continues to extend its lead; meanwhile, the α in ECAR decreases, and therefore ECAR behaves more like SIS.
Since it seems too advantageous to SIS that the influential genes are all from the first block, we also tried to select the genes from the whole set randomly. Figure 5 shows that ECAR outperforms SIS, ridge, and stability selection consistently and is highly competitive to lasso except when noise is extremely low ( R 2 = 0.95). The path of ECAR and CAR is very similar in the figure, which indicates the result is not very sensitive to α in this study. We also performed the sensitivity analysis in which different coefficient distributions are used to estimate α . We noticed that the result of ECAR may be affected when the coefficients of features in the test sets are generated from an entirely different distribution from the one (uniform distribution) we use in estimating α , the result may change a bit. For example, when the coefficients of features in the test sets are sampled from the standard normal distribution, the performance of lasso and stability selection is greatly enhanced. At the same time, ECAR, CAR, and ridge select fewer true positives. This decline of performance is due to their grouping property (positively correlated features tend to have the same scores). If the positively correlated influential features' coefficients are sampled from the standard normal distribution and thus have different signs, these features' scores would tend to cancel each other out, and this will undermine the performance of these methods. However, the effect is not significant, as can be seen in Supplementary Table S2. (2) Y = Xβ + ε We removed the genes with zero expression in more than 20% of the samples. For each expression data, 1000 genes with the largest variance were selected, log-transformed and normalized to form the simulation dataset. The sample sizes of simulation datasets are 512, 369, and 497, respectively. The comparison method is the same as that in the previous section. Figure 6 summarizes the performance of feature selection for each method on the LUAD dataset (results on LIHC and LUSC are similar). The medians of the best α and corresponding standard The total number of influential genes is 30, which are randomly selected from the first 300 genes (first block). Parameter α of ECAR is estimated using the methods described in the Methods section. The regularization parameter of ridge and lasso is estimated using fivefold cross-validation and generalized cross-validation, respectively. As lasso cannot select more variables than the sample size, we let it choose genes randomly when all genes in the output selected set are chosen.  Figure 6 looks similar to Fig. 3, as ECAR still outperforms CAR when R 2 is above 0.6 or below 0.4, and it selects more true positives than SIS and stability selection consistently. Compared with the more artificial examples in the previous section, lasso performs better as its advantage over ECAR maintains until R 2 drops to 0.6. When there is high noise ( R 2 = 0.2), all methods behave similarly. These results further demonstrate that when the features have a general correlation structure, ECAR still improves CAR, and is competitive to some classic variable selection methods. We also performed several sensitivity analyses, in which we changed the distribution of coefficient, number of influential genes. It turns out the results are quite similar to those shown here ( Supplementary Fig. S6-S8). Another interesting thing is that the best α seems to be insensitive to the dimension: when we increased the dimension to 4000, the best α under different R 2 levels remains almost the same.   We applied ECAR, CAR, lasso, stability selection, SIS, ridge on these three datasets. Since we do not know the truly influential SNPs, we compared the mean square error (MSE) on the test set indexed by the number of SNPs in the model instead. For the spike length, lodging degree, and leaf width dataset, the R 2 are estimated to be 0.4, 0.32, and 0.45; the numbers of influential variables are estimated to be 326, 152, and 145; parameter α is estimated to be 0.45 ± 0.03, 0.4 ± 0.07, and 0.5 ± 0.06.
Tables 1 and 2 summarizes the generalization performance of high-rank SNP features evaluated by lasso and ridge on the three datasets. The generalization performance is assessed by the MSE of lasso or ridge on the test sets. From the two tables, we can see that while none of these methods perform consistently well in every case, SNPs ranked by ECAR scores have relatively high prediction accuracy overall. The MSE of ECAR is less than CAR, lasso, and ridge in all cases except in the leaf width dataset where ECAR is equivalent to CAR (α = 0.5) . Stability selection seems to perform quite well in terms of prediction error among multivariate methods. We   www.nature.com/scientificreports/ ECAR applied to the LUAD dataset. We analyzed forced expired volume in 1 s (FEV1) in 230 patients from the TCGA LUAD cohort (total n = 577). In the data preprocessing step, 3781 genes which have zero value for more than 20% of samples were removed. This results in a dataset of 230 samples and 16,750 genes. After the logarithmic transformation on both the gene expressions and FEV1, we performed ECAR on the data. The R 2 and the number of influential variables were estimated to be 0.1 and 9, respectively. In this data, α estimates had a median of 0.15 and a standard deviation of 0.28. Controlling the false discovery rate at 5%, ECAR selected six genes CHRM3, CTCFL, KCNE2, MLANA, MSMP, TTLL2, many of which have been reported to be associated with lung function or cancer. For example, CHRM3 encodes the muscarinic acetylcholine receptor M3, which is a well-characterized drug target for which many approved drugs exist, including for the treatment of asthma and obstructive lung disease 28 . BORIS transcripts were expressed in lung carcinoma cell lines at high to moderate levels 29 . KCNE2 and TTLL2 might be associated with pulmonary function 30,31 . MSMP hindered the effect of anti-VEGF therapy 32 , and it could promote xenograft PC3 growth and reduce the survival of PC3 metastatic mice model 33 . The genes selected by SIS and CAR are listed in Table 3.

Discussion
We developed the ECAR scores, which can simultaneously measure the importance of all the variables in regression models. We showed that by diligently searching the parameter which maximizes PR-AUC, ECAR can improve traditional methods like SIS, ridge, CAR in terms of the feature selection accuracy, while maintaining strong predictive power in high-rank features. ECAR is also highly competitive to popular variable selection methods like lasso, notably when influential factors are correlated. Moreover, it enjoys the grouping property that strongly correlated variables will tend to be selected together. Another advantage of ECAR is that its parameter is insensitive to the sampling setting: even the coefficients' distribution used is different from the real one, the results would not be significantly different. The flexibility of ECAR not only enables it to perform well under Table 1. Summary of the generalization performance of high-rank SNP features evaluated by lasso on the three datasets. Base lasso is the prediction performance of lasso on the test sets using all features as input. See the "Methods" section for further details. www.nature.com/scientificreports/ settings that are unsuitable for CAR, but also ensure it has approximately equal performance when CAR works satisfactorily. Some researchers applied the CAR scores to SNP selection in the GAW17 dataset and found CAR performed much better than other classic methods 34 . We tried to estimate α in this dataset, and it turned out the best α is very close to 0.5, which explains why CAR can perform better than those classic methods in this data.
One issue with ECAR is that it is computationally intensive when estimating the parameter α , since we have to compute the power of the correlation matrix 21 times. This process can be accelerated using the methods proposed by Strimmer 21 , which enables substantial computational savings when the sample size is much smaller than the number of features. Another issue is that it is not a completely automatic method, which means we have to estimate some parameters like R 2 and the number of influential variables. The estimation accuracy will affect the result, but luckily the effect is limited. Finally, when the data has a small R 2 or number of influential variables, the standard deviation of α may be quite large, and it might be more secure to apply the more conservative SIS in this case.

Data.
To evaluate the performance of the ECAR scores, we used three kinds of datasets. The first kind of datasets has 200 samples and 600 features which were split into two equal-sized groups. The datasets were generated from a multivariate normal distribution with mean zero. The correlations between any two features in the same group are 0.25, and features in different groups are independent. The outcomes were generated by a Gaussian linear model in which influential features are randomly selected. The second datasets we used were semi-synthetic datasets. We called it semi-synthetic datasets because the features are from real mRNA expression data obtained from 3 cancer projects (LUAD, LIHC, LUSC) from TCGA. The outcomes were generated in the same way as described above. We compared the variable selection accuracy on the datasets mentioned above. We also used three barley datasets downloaded from the T3/barley database whose sample and feature size ranged between 712 to 1947 and 6236 to 6583, to evaluate the prediction performance of ECAR. We applied ECAR to the LUAD dataset and analyzed genes that influence forced expired volume in 1 s (FEV1). The information of datasets is shown in Supplementary Table S1.

Details of the ECAR scores.
Our method takes genomic features and the outcome as input and returns scores that represent the importance of features. In ECAR scores R −α R XY (Eq. 1), the calculation of Pearson Correlation Coefficient R XY is straightforward. As for R −α , we need to estimate α from data, and then we can use function powcor.shrink from R package corpcor to calculate R −α . The procedures for estimating α is as follows. First, to simplify the computation, we limit the choice of α to an equally spaced sequence which contains 21 numbers ranging between 0 and 1 (0, 0.05, 0.1, …,1). Second, we simulate 100 datasets using a Gaussian linear model Y = Xβ + ε , where ε ∼ N n (0, σ 2 I n ) . In this model, X is given and we need to predetermine β and σ 2 to generate Y . If R 2 and the number of influential variables s are already known approximately to us, we can just randomly select s variables whose corresponding β are sampled from the uniform distribution with minimum 0 and maximum 1 (the rest of the β are 0); the σ 2 is set to be (1−R 2 )β T X T X β nR 2 so that the model can achieve the predetermined R 2 . If R 2 and s are not given, we can estimate them from the data. Many methods can be applied to estimate R 2 , and refitted cross-validation 35 is an example. The number of influential features s can be estimated based on the cardinality of lasso's selected set, and the regularization parameter of lasso can be estimated by generalized cross-validation to avoid numerical instability. Our sensitivity analysis results showed that the result is not very sensitive to these estimates. To ensure that R is positive definite, we use the shrinkage approach proposed by Strimmer et al. 36 implemented in R package corpcor. After generating the simulated datasets, the PR-AUC (Prediction-Recall Area Under Curve) can be computed at each value of α , and for each dataset the α that maximizes PR-AUC will be selected. We take the median of these α values to be the estimate for α . To work out a cutout for the scores and achieves false discovery control, we apply the adaptive false discovery rate density approach 23 (using function fdrtool from R package fdrtool).
Evaluation measures. The performance of ECAR was evaluated in terms of feature selection and prediction accuracy. ECAR assigns a score to each feature in the dataset, and these features are later ranked and selected by the model in descending order according to the absolute value of their scores. For the evaluation of ECAR's performance on simulated datasets whose true influential features are known, we used PR-AUC, which is the area under the precision-recall curve created by plotting the precision against the recall at various threshold settings. The precision is the number of true positive factors divided by the number of selected factors, while recall is the fraction of true positive factors that are retrieved. We also plotted the number of true positive factors against the number of total selected factors. For the real data, we looked at the prediction performance of the selected SNP features. The whole data were randomly split into a training set of 2/3 of total samples and a test set Table 3. Summary of selected genes for each method.

Sensitivity analysis.
To test the stability of α under different sampling settings, we perform the sensitivity analysis in which three distributions (uniform, normal and folded normal) of coefficients are used as the real distribution, and in all cases we only use uniform distribution in estimating α . For the datasets mentioned above, we studied the influence of the misspecification of parameters to the feature selection performance ( Supplementary  Fig. S1-S8 and Supplementary Table S2).
Comparison with other methods. ECAR was compared with CAR, SIS, Ridge, Lasso, stability selection and random selection (randomly rank the features, denoted by RND) in terms of feature selection accuracy and generalization performance of high-rank features. The rank of the variables is based on their scores calculated from different methods. Sure Independence Screening (SIS) is a univariate correlation ranking method that ranks features' importance according to their marginal correlation with the response variable. It is helpful in ultra-high dimension settings for screening irrelevant variables; however, it could lead to a high rate of false positives regarding importance ranking problems in moderately high dimension settings. In our study, the scores are the absolute value of Pearson Correlation Coefficient R XY . Lasso and Ridge are specific cases of bridge estimators β = arg min β∈R p 1 n ||y − Xβ | 2 + |β| | q q when q = 1 and q = 2 , where X is n × p design matrix, and y is n × 1 response vector. When used for feature ranking, a two-stage variable selection (TVS) technique will be applied. The first stage computes the bridge estimators, and the second stage thresholds this estimate to rank the predictors. A critical difference between these two methods is that Lasso gives a set of zero regression coefficients and leads to a sparse solution. A well-known problem with Lasso is that it tends to select only one variable from a group of highly correlated genomic factors. Also, it cannot select more variables than the sample size. For Ridge and Lasso in this study, their parameter are estimated by fivefold cross-validation and generalized cross-validation. Stability selection ranks each variable by the probability of being selected by specific selection methods such as Lasso. It is better suited for variable selection, but the price to pay is the reduced power to identify true signals. In stability selection, we first draw 100 subsamples of size n/2 without replacement, and then apply lasso on these subsamples and our scores are the frequency of each feature being selected. As for choosing the parameters, we set the number of false positives v to be 2.5 and the cutout π cut to be 0.7; therefore, the regularization parameter should be adjusted to ensure vp(2π cut − 1) ( p is the total number of variables) features are selected in the model in each replicate, according to the result in the paper 20 . Another approach used for feature ranking is CAR scores, which are calculated based on the correlations between the de-correlated variables and the response variable. The calculation of CAR scores is relatively simple, which is R −0.5 R XY , yet it performs well in some cases. However it could be too aggressive or conservative in real problems.