An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regression

High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest-solution guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the shortest least-squares solution of the recursively decimated linear models, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, adaptive LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.

www.nature.com/scientificreports/ to determine the individual contributions of the chosen predictors.Sparse high-dimensional linear regression has also been studied from the angle of compressed sensing 2 .
In principle, the regression coefficients can be specified by searching for the solution with the least number of nonzero elements, but this non-convex l 0 minimization problem is intractable in practice.Over the years a variety of approaches have been proposed to approximate the optimal l 0 solution.The existing approaches can roughly be divided into three categories: relaxation methods, physics-inspired message-passing methods, and greedy methods.The basic idea of the relaxation methods is to replace the non-smooth l 0 -norm penalty with a smooth approximation.Among them the Least Absolute Shrinkage and Selection Operator (LASSO) 3,4 , which uses the l 1 -norm penalty, is the most popular one.LASSO is a convex optimization problem, which can be solved by methods such as LARS [5][6][7] , coordinate descent 8 and proximal gradient descent 9 .However, due to the over-shrinking of large coefficients, LASSO is known to lead to biased estimates.To remedy this problem, some alternative methods have been proposed, including multi-stages methods such as adaptive LASSO 10 and the three-stage method 11 , and non-convex penalties such as the smoothly clipped absolute deviation (SCAD) penalty 12 and the minimax concave penalty (MCP) 13 .
An alternative strategy comes from the approximate message-passing (AMP) methods, which are closely related to the Thouless-Anderson-Palmer equation in statistical physics that is capable of dealing with highdimensional inference problems.They have shown remarkable success in sparse regression and compressed sensing [14][15][16][17] .However, the convergence issue limits the practical application of the AMP methods, especially on problems with highly correlated predictors.Recently, several algorithms such as Generalized AMP (GAMP) 18 , SwAMP 19 , adaptive damping 20 , mean removal 20 and direct free-energy minimization 21 were proposed to fix this problem.Especially, the orthogonal or vector AMP (VAMP) algorithm 22,23 offers a robust alternative to the conventional AMP.
Another line of research focuses on greedy methods for l 0 minimization such as orthogonal least squares (OLS) 24 and orthogonal matching pursuit (OMP) 25 .The main idea is to select a single variable vector that has the largest magnitude of (rescaled) inner product with the current residual response vector at each iteration step.A sure-independence-screening (SIS) method based on correlation learning was proposed to improve variable selection 26 , and an iterative version of this SIS approach (ISIS) could be adopted to enhance the performance of variable selection 27 .Several more recently developed greedy methods proposed to select several variables at a time, including the iterate hard thresholding (IHT) algorithm 28,29 , the primal-dual active set (PDAS) methods 30 , and the adaptive support detection and root finding (ASDAR) approach 31 .
Most of the above-mentioned approximate methods generally assume that the measurement matrix satisfies some regularity conditions such as the irrepresentable condition and the sparse Riesz condition, for mathematical convenience or good algorithmic performance.Roughly speaking, these conditions require that the predictors should be fully uncorrelated or only weakly correlated.But these strict conditions are often not met in real-world applications.As such, it is desirable to develop an efficient and robust method applicable for more general correlation structures of the predictors.Recently, the shortest-solution guided decimation (SSD) algorithm 32 is proposed as a greedy method for solving high-dimensional linear regression.Similar to OLS and OMP, at each iteration step SSD selects a single variable as a candidate predictor.The difference is that this selection is based on the dense least-squares (i.e., shortest Euclidean length) solution of the decimated linear equations.Initial simulation results demonstrated that this SSD algorithm significantly outperforms several of the most popular algorithms ( l 1 -based penalty methods, OLS, OMP, and AMP) when the measurement matrices are highly correlated.
Although the SSD algorithm is highly competitive to other heuristic algorithms both for uncorrelated and correlated measurement matrices, a crucial assumption in its original implementation is that there is no measurement noise ( ε = 0 ).As we will demonstrate later, when the measurement noise is no longer negligible, the naive noise-free SSD algorithm fails to extract the sparse solution of linear regression.To overcome this difficulty, here we extend the SSD algorithm and propose the adaptive SSD algorithm (ASSD) to estimate the sparse highdimensional regression models.Compared with the original SSD, the new ASSD algorithm adopts a much more relaxed termination condition to allow early stop.Furthermore and significantly, we add a second-stage screening to single out the truly important predictors after the first-stage estimation is completed.
We test the performance of ASSD both on synthetic data (predictors and responses are both simulated) and on semi-synthetic data (real predictors but simulated responses, using the gene expression data from cancer samples).In comparison with the representative algorithms LASSO, adaptive LASSO (ALASSO), VAMP, and two greedy methods (ASDAR and SIS-LASSO), our extensive simulation results demonstrate that ASSD outperforms all these competing algorithms in terms of accuracy and robustness of variables selection and coefficients estimation.It appears that ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.On the other hand, ASSD is generally slower than these other algorithms, pointing to a direction of further improvement.It may also be interesting to analyze theoretically the SSD and ASSD algorithms.

Methods
The shortest solution as a guidance vector.We first briefly summarize the key ideas behind the SSD algorithm 32 .Consider the singular value decomposition (SVD) of the measurement matrix X : X = U DV ⊤ , where U = (u 1 , . . ., u n ) is an n × n orthogonal matrix, and V = (v 1 , . . ., v p ) is a p × p orthogonal matrix, and D is an n × p diagonal matrix of the singular values 1 ≥ 2 ≥ . . .≥ n .Here {u 1 , . . ., u n } and {v 1 , . . ., v p } form a complete set of orthonormal basis vectors for the n-and p-dimensional real space respectively, so the vectors u i = (u 1i , . . ., u ni ) ⊤ satisfy u ⊤ i u j = δ ij , and vectors , where δ ij is the Kronecker symbol: δ ij = 0 for i = j and δ ij = 1 for i = j .We can express the true coefficient vector β 0 as a linear combination of the basis vectors v j : www.nature.com/scientificreports/Substituting the above expression into the regression function E(y|X) = Xβ 0 of model (Eq. 1),we obtain that with the parameter c i for i = 1, 2, . . ., n being Here �(x) is the Heaviside function: �(x) = 1 for x > 0 and �(x) = 0 for x ≤ 0 .We define a vector γ as Then We call γ the guidance vector 32 .This vector γ is dense and it is not the true coefficient vector β 0 we are seeking.However, interestingly, this dense vector γ does provide information about the locations of nonzero elements of β 0 (see Fig. 1 and also the earlier empirical observations 32 ).To understand this important property of γ , firstly, we reformulate the matrices V and D as partitioned matrices: 1 , which is a p × p symmetric matrix.According to Eq. ( 7), each element γ i of γ is (2) Top: rank curves for the estimated guidance vector.Bottom: proportion q(r) of nonzero elements of β 0 among the r top-ranked indices i.In the two subfigures on the left panel, σ 2 = 1 and s 0 = 10, 20, 30 ; and in the two subfigures on the right panel, s 0 = 20 and σ 2 = 0, 0. Q ii ≈ n/p ≡ α (here α is the compression ratio).Expecting that v il and v jl are almost independent of each other, we get that Q ij ≈ ± √ n/p , where ± means that Q ij is positive or negative with roughly equal probability.Define ρ := s 0 /p as the sparsity of β 0 .Because β 0 is a sparse vector with only ρp nonzero entries, the summation in the right hand side of Eq. ( 8) contains at most ρp terms.Neglecting the possible weak correlations among Q ij ( j = i ), we have is the mean magnitude of the β 0 i coefficients.Putting the above approximations together, we finally get Notice that the second term in the right hand side of the Eq. ( 9) is independent of the index i.For the element γ k that has the largest magnitude among all the elements of γ , we expect that the two terms in the right hand side of Eq. ( 9) have the same sign, and thus it will have a relatively large magnitude.It then follows that the corresponding β 0 k is very likely to be nonzero and also |β 0 k | m 0 .The above analysis offers a qualitative explanation on why the guidance vector γ can help us to locate the nonzero elements in the sparse vector β 0 .When there is no noise ( ε = 0 ), this guidance vector is easy to deter- mine and it is the shortest Euclidean-length solution of an underdetermined linear equation.In the presence of measurement noise ( ε = 0 ), however, the conditional expectation E(y|X) is unknown.Then we cannot get the exact value of the guidance vector γ but can only get an approximate γ .Consider the estimator where 1 U ⊤ is the Moore-Penrose inverse of X .Notice that γ is nothing but the shortest length (i.e., minimum l 2 norm) least-square solution of linear model (Eq. 1) (hereinafter referred to as the "shortest-solu- tion").It can be proved that γ is the best linear unbiased estimator to γ (see Supplementary Section A).Combined with the above theoretical analysis, we conjecture that γ is also helpful for us to guess which elements of the true coefficient vector β 0 are nonzero.The validity of this conjecture has been confirmed by simulation results.Figure 1 shows the magnitude of elements γk of the estimated guidance vector γ in descending order (top) and the proportion q(r) of nonzero elements of β 0 among the r top-ranked indices i (bottom) for datasets generated from models with uncorrelated Gaussian measurement matrices with n = 200 , p = 1000 , s 0 = 10, 20, 30 and σ 2 = 0, 0.5, 1 .It can be seen that γ indeed contains important clues about the nonzero elements of β 0 : for the indices k that are ranked in the top in terms of magnitude of γk , the corresponding β 0 k values have high probabilities to be nonzero.In particular, for the 10 top-ranked indices in the examples of s 0 = 20 , the corresponding entries in β 0 are nearly all nonzero.
In practice, the estimated guidance vector γ can be solved through LQ decomposition or convex optimization which is more efficient than SVD.In our simulation studies, we employ the convex optimization method 32 , see Supplementary Section B for the explicit formula.
Shortest-solution guided Decimation.Based on the above theoretical analysis and empirical results, we now try to solve the linear model (Eq. 1) through a shortest-solution guided decimation algorithm.Specifically, let β = (β 1 , . . ., β p ) ⊤ be a p-dimensional coefficient vector.Assume that the k-th element of the guidance vector γ has the largest magnitude.If all the other (p−1) elements β i of the vector β are known, β k can be uniquely determined as the solution of the minimization problem Plugging Eq. ( 11) into model (Eq.1), we obtain that where 1) decimated measurement matrix with its column vector X ′ i being and y ′ is the residual of the original response vector, (8) www.nature.com/scientificreports/Notice that Eq. ( 12) has the identical form as that of the original linear model (Eq.1).Therefore, we can obtain the corresponding estimated guidance vector through the least-squares solution of Eq. (12).Then, we repeat the above decimation process (Eqs.11-14) to further shrink the residual response vector, until the certain stopping criterion is met.Suppose that a total number L of elements of β have been picked during this whole decimation process.We can uniquely and easily determine the values of these L elements by setting all the other (p−L) ele- ments to be zero and then backtracking the L constructed equations of the form Eq. ( 11).
Up to now, this above SSD algorithm is the same as the original algorithm 32 .In the original SSD algorithm, the stopping criterion is that the magnitude of the residual response vector becomes less than a prespecified threshold (e.g., 10 −5 ).We test the performance of SSD on a single noisy problem instance, to test if this stopping criterion is still appropriate for the noisy situation.Figure 2 shows the trace of the SSD process on two datasets with noise level σ 2 = 0 (left) and σ 2 = 1 (right).The two datasets have the identical 200 × 1000 measurement matrix X and the true coefficient vector β 0 with s 0 = 30 nonzero elements, which are sampled from uniform distribution U[0.5, 1] .We see that, for the noise-free situation, the decimation stops (i.e., 1 n ||y ′ || 1 < 10 −5 ) after L = 30 steps (top left) with all the nonzero elements of β 0 being recovered exactly (bottom left).However, once the noise is added, there is a significant increase in the number of decimation steps ( L = 187 , top right), and the resulting coefficient vector β is dense and is dramatically different from β 0 (bottom right).These results suggest that the stopping criterion used in the original SSD algorithm is no longer appropriate for the linear regression model with noise and needs to be improved.

Adaptive shortest-solution guided decimation (ASSD). Modified stopping criterion.
With an additional examination on the bottom right panel of Fig. 2, we find that during the early steps of the decimation process the identification of the nonzero elements of β 0 is highly accurate.Specifically, there are only four mistakes in identification in the initial 30 decimation steps.In later decimation steps, however, the index k of the largest-magnitude element γk is no longer reliable, in the sense that the true value of β 0 k may be zero.These misidentified elements are too numerous to be corrected by the subsequent backtracking process of SSD, and the resulting coefficient vector β is then quite different from β 0 .These observations indicate the necessity of stopping the decimation process earlier.
Firstly, we set an upper bound L max for the number of decimation steps.It has been established that the true coefficient vector β 0 cannot be reconstructed consistently with a sample of size n if there are more than O n/ ln(n) nonzero elements 33 .We therefore take ( 14) ).Top: trace of l 1 -norm of the residual response vector y ′ .The red stars on the horizontal axis signify that the identified element β 0 k of β 0 , with k being the index of the largest magnitude element γk of γ , is indeed nonzero.Bottom: results of coefficient estimation.The two subfigures on the left panel display the noise-free situation ( σ 2 = 0 ), and those on the right panel display the noise situation ( σ 2 = 1).
We repeat the shortest-solution guided decimation only up to L max steps.Additionally, we estimate β 0 by the solution of the l 1 minimization problem 34 Under certain conditions on the RIP (restricted isometry property) constant of X , the estimation error measured in l 2 -norm of β 0 is of the order of η/ √ n 34 .Inspired by this insight, we terminate the SSD process earlier than L max steps once the Euclidean length of the residual response vector (i.e., �y ′ � 2 ) is smaller than a prespecified value η.
Second-stage thresholding after SSD.Even after the early stopping strategy is applied to the decimation process, we find that some of the zero-valued coefficients β 0 i are still predicted to be nonzero by the algorithm.To reduce this false-positive fraction as much as possible, we propose a second-stage thresholding procedure to the SSD algorithm.The idea is to manually reset some of the coefficients β i = 0 if the value predicted by the SSD algo- rithm is below a certain threshold value.This refinement procedure turns out to be rather effective in improving the variable selection accuracy.
Suppose that after early stopping L of β are assigned with nonzero values, and the indices of all the zero-valued coefficients form a set A (i.e., β i = 0 if and only if i ∈ A ).We sort the absolute values of these L esti- mated coefficients in an ascending order (say |β L/2 j=1 β r j .Notice σ is distinct in meaning from the the noise magnitude σ of the original model system (Eq. 1).We adopt a data-driven procedure to determine the optimal thresholding level.First we set to be the basic thresholding level 35 (see also the initial work on thresholding to wavelet coefficients 36 ).Next, we take the actual thresholding level θ to be θ = τ θ 0 with τ taking discrete values.As τ increases from zero to a relatively large value R (e.g., R = 20 ), the threshold value θ becomes more and more elevated.At a given value of τ , we first update the index set A by adding some indices i to A if |β i | < τ θ 0 , and then we update the remaining elements of β by solving the minimization problem Finally, we compute the BIC (Bayesian Information Criterion) index 33 as where p nz means the number of nonzero elements in the vector β , namely p nz = p − |A| .The BIC value is a trade-off between the prediction error and the model complexity.We choose the value of τ such that BIC achieves the minimum value, and consider the corresponding coefficient vector β as the final solution of the linear regres- sion problem (Eq.1).
An initial demonstration of ASSD performance.We summarize the above ideas in the pseudo-code of Algorithm 1.This ASSD algorithm has two parts: decimation with early stopping, followed by refinement by secondstage thresholding.
Let us work on a small example case to better appreciate the working characteristics of ASSD.We generate an n × p random Gaussian matrix X with n = 200 and p = 1000 , whose elements are i.i.d.N (0, 1) distributed.The truth coefficient vector β 0 has s 0 = 30 nonzero elements, each of which is sampled from uniform distributions U[−1, −0.5] and U[0.5, 1] with equal probability, and 970 zero elements.The response vector y is generated from the linear regression model (Eq. 1) with error level σ 2 = 1 .We compare the performance of ASSD with that of the original SSD which does not conduct early-stopping nor the second-stage thresholding, and that of SSD1, which only adopts early-stopping but skips the second-stage thresholding.
The algorithmic results shown in Fig. 3 reveal that all these three algorithms assigned good approximate values for the nonzero elements of β 0 .SSD has a high false-positive rate (154 of the zero elements of β 0 are misclassified as nonzero), and early-stopping dramatically reduces this rate (only 8 false-positive predictions in SSD1).By applying the second-stage thresholding, ASSD achieves a zero false-positive rate.In addition, ASSD and SSD1 are more efficient than SSD (SSD, 27.2 s; SSD1, 7.83 s; ASSD, 8.2 s).Overall, the presence of the measurement noise usually renders SSD to produce a solution with a high false-positive rate, and two modifications of ASSD, i.e., a modified early-stopping criterion, coupled with a second-stage thresholding, are proposed to reduce the false-positive rate as much as possible.

Results
Model implementation.To better gauge the performance of ASSD, we compare ASSD with five different methods: LASSO, Adaptive LASSO (ALASSO) , VAMP, SIS+LASSO, and ASDAR.We implemented all these methods in Matlab.Our implementation of LASSO uses the function lasso.For ALASSO, we use the LASSO solution βLASSO as the initial estimator, and set the weight as ω j = 1/| βLASSO j | , j = 1, . . ., p .For VAMP, we use the publicly available Matlab package 37 .The algorithm SIS+LASSO first selects (n ln n) variables based on SIS and then runs LASSO to further reduce the number of falsely identified nonzero coefficients.We implement ASDAR by using the Matlab package sdar 31 .
For ASSD, we set R = 20 , L max = n/ ln n (if not specified) and η = √ nσ (in practical applications, if σ (the s.d. of noise) is unknown, we can set η to be a small value, e.g.0.1 as used in Fig. 3).For LASSO, ALASSO, and SIS+LASSO, the tuning parameters are selected by using 10-fold cross validation.For ASDAR, we set τ = 5 and stop the iteration if the number of identified nonzero elements is greater than L = 0.5n , or the residual norm is smaller than √ nσ , or the distance of two subsequent solutions (measured in l 2 -norm) is smaller than 1.For VAMP, a small amount of damping is useful when the measurement matrix is ill-conditioned.We set the dampling parameter to be 0.95.Other parameters of VAMP, including the maximum number of iterations, the tolerance for stopping, are the default values in public-domain GAMPmatlab toolbox 37 .
We focus on four metrics for algorithmic comparisons: (1) the relative error (RE) of estimation, defined as β − β 0 2 / β 0 2 ; (2) the true positive counts (TP) and (3) the false positive counts (FP) of variable selection, and (4) the CPU time in seconds.In each scenario, we calculate the average and standard deviation of these four metrics over 96 independent runs.• Correlated Gaussian matrix: Each row of the matrix X is drawn independently from N (0, ) , where � ij = π |j−i| , 1 ≤ i, j ≤ p with π = 0 and 0.7 corresponding to no and strong correlations.• Structured matrix: The matrix X is the product of an n × r matrix X 1 and an r × p matrix X 2 .Both X 1 and X 2 are random Gaussian matrices whose elements are independently generated from N (0, 1) .The rank r is closely related to the degree of correlation between elements in matrix X .When r ≫ n , the elements in matrix X are weakly correlated or even uncorrelated.As r approaches n from above, the elements in matrix X are more and more correlated.We consider two scenarios: r = n + 2000 = 2300 and r = n + 5 = 305 corresponding to weakly correlated and highly correlated (or structured) matrices, respectively.• Real-world matrix: We choose the gene expression data from The Cancer Genome Atlas (TCGA) ovarian cancer samples 38 and we use the dataset provided by two earlier studies 39,40 .The dataset is available athttps:// bioin forma tics.mdand erson.org/ Suppl ements/ Resid ualDi sease/.There are 594 and 22, 277 genes in the original dataset.We randomly subsample the samples and genes to obtain a 300 × 2000 measurement matrix X.
The response vector y is generated via the linear regression model Eq. ( 1), in which the random errors are inde- pendently generated from normal distribution with means 0 and variance σ 2 = 1.

Correlated Gaussian matrix.
Table 1 shows the results on Gaussian matrices.Here and hereinafter, the standard deviations of metrics are shown in the parentheses, and in each column, the numbers in boldface indicate the best performers.It is observed that ASSD has the best performance in variable selection.ASDAR achieves similar performance with ASSD when there is no correlation ( π = 0 ), but it suffers from identifying more false positives when π increases to π = 0.7 .For estimation, ASSD again has the best or close-to-the-best performance compared with VAMP.Although VAMP produces a smaller relative error than ASSD when π = 0 , its perfor- mance deteriorates significantly when the correlation π is high.ASSD shows no advantage in speed.ASSD is similar to LASSO and ALASSO in computation time, but it is much slower than ASDAR and VAMP.
Structured matrices.Results on the structured matrices are reported in Table 2.We see that when the rank number r of the matrix X is large, i.e. r = 2300 , VAMP, ASDAR, and ASSD are on the top of the list in metrics for variable selection (TP and FP) and the metric for estimation (RE).As the rank number r approaches n ( r = 305 ), ASDAR becomes less accurate in variable selection and coefficient estimation than VAMP and ASSD.The favorable performance of VAMP is not unexpected because it can achieve Bayes-optimal estimation for a class of structured measurement matrix, namely that of the rotationally-invariant matrix.It is encouraging to The measurement is 200 × 1000 , while the noise level is σ 2 = 1 .The blue crosses mark the 30 nonzero elements of the true coefficient vector β 0 , the red circles mark the elements of the estimated coefficient vector β .In ASSD, we set R = 20 and η = 0.1.
Vol www.nature.com/scientificreports/observe that ASSD performs comparably well in variable selection and estimation, even when the measurement matrix is highly structured.
Real-world matrix.Table 3 shows the results on a real measurement matrix which is a subsample of gene expression data from TCGA ovarian cancer samples.ASSD again has the best performance both in variable selection and in coefficient estimation.LASSO, ALASSO, SIS+LASSO, and VAMP do not work well on the real matrix as they identify too many false positives and produce significantly larger estimation errors.ASDAR is similar to ASSD in terms of estimation error, but it is inferior to ASSD in terms of variable selection.It is indeed quite a remarkable observation that only the ASSD algorithm achieves almost perfect accuracy for this realworld problem instance.We conduct additional simulation to examine dependence of the proposed ASSD on the distribution of coefficients.First, consider strict sparsity case.The nonzero coefficients are sampled from the uniform distribution    S3 reports the results on the real-world matrix.We also consider a larger noise with σ 2 = 1.5 .The nonzero coefficients are sampled from uniform distribution U[0.5, 1] .The simulation results on the real-world matrix are displayed in Supplementary Table S4.Under the above four scenarios, ASSD generally have favorable performance in variable selection and coefficient estimation, in particular, it has the lowest false positives.It is noted that, compared to LASSO and ALASSO, ASSD identifies less true nonzero coefficients.A possible reason is that the adopted stopping criterion is a little bit too strict (i.e., L is too small), and the values of TP are expected to increase with looser stopping criterion.These results demonstrate that ASSD is especially well-suited to scenarios which put a higher value on precision than recall of variable selection.
Influence of model parameters.We now investigate more closely the effect of each of the model parameters (the sample size n, the number of predictors p, and the sparsity level s 0 ) on the performance of LASSO, VAMP, SIS+LASSO, ASDAR, and ASSD.From the above simulation results, we observe that ALASSO performs better than LASSO in variable selection; however, the improvements over LASSO is not large, but with greater computational cost.As such, we do not report the results of ALASSO in this section.The same three types of measurement matrices are examined: Gaussian matrix with π = 0.7 , structured matrix with the rank number r = n + 5 , and the real-world matrix.The nonzero elements of β 0 are i.i.d.random values drawn from the uniform distribution over [0.5, 1].We generate the response vector from the linear regression model (1).The random errors are generated independently from N (0, 0.5) .The simulation results are based on 96 independent repeats.Figure 4 shows the influence of sample size n on the relative errors (top left panel), true positives (top right panel), false positives (bottom left panel), and probability of exact identification of nonzero coefficients (bottom right panel) when the measurement matrix is a correlated Gaussian one.Results obtained on the other two types of measurement matrices can be found as Supplementary Fig. S1 and Supplementary Fig. S2 .(For the real-world matrix and the structured measurement matrices, the results of VAMP are too unstable to be shown here and hereinafter.)As expected, the performances of all the methods improve as n increases.For the real-world and the correlated Gaussian matrices, ASDAR and ASSD perform comparably well in estimation accuracy.However, ASSD performs significantly better than ASDAR in the accuracy of variable selection.Specifically, ASSD can exactly recover the support when n = 300 , whereas the success probability of ASDAR is only 42% .Similar observations are obtained for the real-world measurement matrix.For the structured measurement matrices, VAMP has the best performance, and ASSD again has close-to-the-best performance compared with ASDAR, LASSO, and SIS+LASSO.
Figure 5 shows the influence of the number of covariates p on the performances of the four methods for a correlated Gaussian measurement matrix.Data are generated from the model with s 0 = 30 and n = 300 .We see www.nature.com/scientificreports/ that ASSD always produces the lowest relative errors and FP, and the highest TP.In particular, the probability of exactly recovering the support of the true coefficient vector β 0 of ASSD is higher than that of the other methods as p increases, which indicates that ASSD is more robust to the number of covariates.Similar observations are also made for the other types of measurement matrices as Supplementary Fig. S3 and Supplementary Fig. S4 .The influence of the sparsity level s 0 on the performance of the four methods for a correlated Gaussian meas- urement matrix is presented in Fig. 6.The corresponding results obtained for the other types of matrices are presented as Supplementary Fig. S5 and Supplementary Fig. S6 .Data are generated with n = 300 and p = 2000 .We use L = 0.5n for ASSD and ASDAR as well since the maximum s 0 = 100 .When the number of nonzero elements s 0 increases, the performances of all the four methods become worse.When s 0 is small (e.g.s 0 ≤ 40 ), ASSD generally has the best performance in the accuracy of estimation and variable selection (i.e., has highest TP and probability of exact identification of nonzero coefficients).However, when s 0 is large, ASSD performs worse than some comparison methods.For example, when s 0 ≤ 70 , ASSD produces higher RE and lower TP than LASSO.These results indicate that ASSD is well-suited to "strong sparsity" scenario where the number of nonzero coefficients is small.In summary, our simulation results demonstrate that the proposed ASSD is more accurate and robust in variable selection and coefficient estimation than LASSO, VAMP, SIS+LASSO, and ASDAR.This ASSD algorithm is a promising heuristic method for highly correlated random and real-world measurement matrices.

Discussion
In this paper, we proposed the adaptive shortest-solution guided decimation (ASSD) algorithm to estimate highdimensional sparse linear regression models.Compared to the original SSD algorithm which is developed for linear regression models without noise 32 , the ASSD algorithm takes into account the effect of measurement noise and adopts an early-stopping strategy and a second-stage thresholding procedure, resulting in significantly better performance in variables selection (which columns X i are relevant) and coefficients estimation (what are the corresponding regression values β i ).Extensive simulation studies demonstrate that ASSD has favorable perfor- mance, and outperforms the comparison methods in variable selection, and is competitive with or outperforms VAMP and ASDAR in coefficient estimation.It is robust to the model parameters, and it is especially robust for different types of measurement matrices such as those whose entries are highly correlated.These numerical results suggest that ASSD can serve as an efficient and robust tool for real-world sparse estimation problems.
In terms of speed, ASSD is slower than VAMP and ASDAR and this is an issue to be further improved in the future.To accelerate ASSD, on the one hand, we can select a small fraction of elements in coefficient vector instead just one of them in each decimation step, and on the other hand, we can adopt a more delicate early-stopping strategy to further reduce the unnecessary decimation steps.In addition, the rigorous theoretical understanding of ASSD needs to be pursued.We have only considered the linear regression model in this paper.It will be interesting to generalize ASSD to other types of models, such as the logistic model and cox model.

Figure 2 .
Figure 2.Simulation results of SSD with the naive stopping criterion 10 −5 on a single uncorrelated Gaussian measurement matrix ( p = 1000 , n = 200 ).Top: trace of l 1 -norm of the residual response vector y ′ .The red stars on the horizontal axis signify that the identified element β 0 k of β 0 , with k being the index of the largest magnitude element γk of γ , is indeed nonzero.Bottom: results of coefficient estimation.The two subfigures on the left panel display the noise-free situation ( σ 2 = 0 ), and those on the right panel display the noise situation ( σ 2 = 1).

2 +
p nz ln n, Vol:.(1234567890)Scientific Reports | (2021) 11:24034 | https://doi.org/10.1038/s41598-021-03323-7www.nature.com/scientificreports/Results on three types of measurement matrices.We first consider different types of measurement matrices X of the same size.In our numerical experiments we set n = 300 and p = 2000 .The number of nonzero coefficients in β 0 is set to be s 0 = 40 , with each of them being generated from the uniform distribution U[0.5, 1] .Three types of measurement matrix are considered:

Figure 3 .
Figure3.Performance comparison on the original SSD (left), SSD1 which is SSD with early-stopping, and ASSD.The measurement is 200 × 1000 , while the noise level is σ 2 = 1 .The blue crosses mark the 30 nonzero elements of the true coefficient vector β 0 , the red circles mark the elements of the estimated coefficient vector β .In ASSD, we set R = 20 and η = 0.1.

Figure 4 .
Figure 4. Simulation results on a correlated Gaussian measurement matrix ( π = 0.7 , p = 2000 , s 0 = 30 , σ 2 = 0.5 ): influence of the sample size n on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).

Figure 5 .
Figure 5. Simulation results on a correlated Gaussian measurement matrix ( π = 0.7 , n = 300 , s 0 = 30 , σ 2 = 0.5 ): influence of the covariates number p on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).

Figure 6 .
Figure 6.Simulation results on a correlated Gaussian measurement matrix ( π = 0.7 , n = 300 , p = 2000 , σ 2 = 0.5 ): influence of the sparsity level s 0 on relative errors (top left), true positives (top right), false positives (bottom left) with the inset being a semi-logarithmic plot, and probability of exact identification of nonzero coefficients (bottom right).

Table 3 .
Simulation results on a real-world measurement matrix with p = 2000 , n = 300 , σ 2 = 1 and s 0 = 40.[0.5, 1] and U[−1, −0.5] with equal probability; and (b) sampled from uniform distribution U[0.2, 1] .The other settings are the same as above.We use the real-world matrix as a representative and present the results in Supplementary TableS1-S2.Second, consider weak sparsity case, where the s 0 = 40 coefficients are sampled from uniform distribution U[0.5, 1] and other coefficients are set to be 0.001.Supplementary Table U