## Introduction

Reconstruction of the seismic wavefield has recently attracted increasing attentions in geophysical community. This is due to the fact that seismic acquisition often violates the Shannon sampling theorem because of the restrictions of investment, topography, noise, bad traces and so on. The under-sampled data will bring aliasing and artifacts which will influence results of migration1, de-noising2, multiple elimination3 and AVO analysis4. In addition, huge storage of the massive data is also a problem, lossless compression methods are desirable5. An important branch of these methods is the sparse transform based method combined with a regularization strategy6,7. For this method, seismic interpolation is treated as an inverse problem, and seismic events are assumed to be sparse in some transformed domain, such as the Fourier transform1,8,9,10,11, or the linear Radon transform12. Usually the acquired geophysical data is subsampled due to the variations of landform1,13,14, hence the seismic wavefield recovery is an ill-posed inverse problem. Therefore, a key issue is how to invert the mathematical model using only incomplete, sub-sampled data1,13,14. Variety of regularization methods has been developed to improve the quality of image and seismic wavefield recovery6,15,16,17.

Previous methods for such a recovery problem are based on the l q -norm minimization, e.g., the basis pursuit denoising (BPDN) criterion using (orthogonal) matching pursuit method18,19 and the least absolute shrinkage and selection operator (LASSO)20 for l1-norm constrained minimization problems. Efficient optimization algorithms include conjugate gradient methods with preconditioning techniques21 and gradient projection methods22,23,24,25,26. For solving the l q -norm minimization problem, people usually convert the matrix form of the wavefield into the vector form and solve the corresponding matrix-vector equations. We observed that the seismic wavefield can be represented using matrix instead of vector as weight variable to express all the signals simultaneously, which takes the interrelation of the sampled observations into account. This is more reasonable as the seismic signals are correlated transversely. Therefore, in this paper we propose a matrix optimization model for the seismic wavefield recovery and study the related properties. The mixed matrix minimization models have been used in machine learning. Rakotomamonjy et al.27 proposed to use the mixed matrix norm lq, p (1 ≤ q < 2, 0 < p ≤ 1) in multi-kernel and multi-task learning. But the induced optimization problems in27 have to be solved separately by different algorithms with respect to p = 1 and 0 < p < 1. For grouped feature selection, Suvrit28 addressed a fast projection technique onto l1, p-norm balls particularly for p = 2, ∞. But the derived method in28 does not match the proposed matrix optimization problem (11). Similar joint sparse representation has been used for robust multi-modal biometrics recognition in29. Sumit et al.29 employed the traditional alternating direction method of multipliers to solve the involved optimization problem. Wang et al.30 applied l2, 0 + -norm to semi-supervised robust dictionary learning, while the optimization algorithm has not displayed definite convergence analysis30.

Recently, matrix-minimization methods with nuclear norm have been developed for seismic wavefield recovery31,32,33,34 which mainly considers the rank reduction as the sparse pattern in 2D cases. To avoid the expensive computations in solving the involved matrix completion optimization problems, a matrix factorization strategy was developed in31,32. This paper proposes a different matrix minimization approach based on l2, ql2, p norm which naturally generalizes the representative vector to matrix in joint distribution sense. A unified method is developed to solve the matrix optimization problem with mixed norm for any q = 2 and 0 < p ≤ 1. The innovations of this paper can be listed as follows:

1. 1)

A jointly sparse matrix minimization model is developed for seismic wavefield recovery. This approach employs matrix to expresses multiple signals simultaneously. The measurement of matrix row coefficients are expected to exhibit the compact priori of multiple observations which is different from the existed methods based on matrix nuclear-norm minimization31,32,33,34.

2. 2)

A unified algorithm is developed to solve the mixed matrix optimization problem (7) for any p$$\in$$ (0, 1]. This algorithm needs only matrix-vector operations but not matrix factorization which can be easily adapted to large-scale cases. The convergence analysis is also demonstrated.

3. 3)

Numerical experiments on synthetic and field data are carried out. The results on seismic wavefield recovery exhibit the efficient recovery performance of the joint sparse expression strategy.

## Modeling

Given a set of seismic signals (traces) x1, x2, …, x l in n-dimensional space, each signal x j (j = 1, 2, …, l) is sensed by m sensors to yield seismic wavefield records as

$${d}_{ij}={A}^{i}{x}_{j},i=1,2,\cdots ,m,$$
(1)

where Ai is a row vector representing the impulse response of the i-th sensor. Denote A = [(A1)T, (A2)T, …, (Am)T]T, then the seismic observations d j  = [d1j, d2j, …, d mj ]T$$\in$$Rm can be reformulated as d j  = Ax j (j = 1, 2, …, l). Sparse expression is a popular strategy to restore x j with m much less than n of the mapping operator A.

Suppose that the original seismic signal x j can be spanned by a series of orthogonal bases $${\{{{\rm{\Psi }}}_{k}\}}_{k=1}^{K}$$ such that

$${x}_{j}(t)=\sum _{k=1}^{K}{m}_{j}^{k}{{\rm{\Psi }}}_{k},$$
(2)

where $${m}_{j}^{k}=({x}_{j},{{\rm{\Psi }}}_{k})$$. Denote Ψ the orthogonal matrix constituted by the orthogonal bases, then we have a more compact transformation L = AΨ $$\in$$Rm × K. Consequently the systems (1) and (2) can be incorporated to

$$L{m}_{j}={d}_{j},\,\,\,\,j=1,2,\cdots ,l,$$
(3)

where $${m}_{j}={\rm{\Psi }}\ast {x}_{j}={\{{m}_{j}^{k}\}}_{k=1}^{K}$$ is the coefficient vector (weighting factor) corresponding to the seismic signal x j . Usually, problem (3) is ill-posed due to the limitation of acquisition and violation of sampling requirements. Sparse regularization is preferred to restore the operation coefficients from the under-determined linear combination system (3). A general l q l p (q > 0, p > 0) model was presented in [16]

$$\mathop{\min }\limits_{{m}_{j}}\,{J}_{j}^{(\alpha )}({m}_{j})=\parallel L{m}_{j}-{d}_{j}{\parallel }_{q}^{q}+{\alpha }_{j}\parallel {m}_{j}{\parallel }_{p}^{p},\,\,\,\,q > 0,p > 0,$$
(4)

where $$\parallel {m}_{j}{\parallel }_{p}^{p}=\sum _{k=1}^{K}|{m}_{j}^{k}{|}^{p}$$ is the stabilizer bearing prior information with respect to d j and α j  > 0 is a regularization parameter. When 0 < p ≤ 1, the minimization model (4) tries to find a sparse recovery coefficient m j with the least nonzero entries. However, the framework (4) recovers the weight factor m j only using the j-th seismic trace record d j independently which totally ignores the correlation with other sampled data $${d}_{\hat{j}}$$ ($$\hat{j}\ne j$$). Generally, multiple seismic wavefield traces are related to each other. The similarity and difference hidden in the given group of seismic traces are expected to improve the recovery performance. To detailedly demonstrate the correlationship among multiple seismic traces, we randomly choose three trace observations from a seismogram generated from a seven layers geologic velocity model (see Experimental Section for details). Two neighboring traces are denoted by d1 and d2 while the third one d3 is relatively far from them. We separately recover the representation coefficients $${m}_{j}^{\ast },j=1,2,3$$ by solving

$${m}_{j}^{\ast }=\text{arg}\mathop{\min }\limits_{{m}_{j}\in {R}^{K}}J({m}_{j})=\parallel L{m}_{j}-{d}_{j}{\parallel }_{2}^{2}+\alpha {\sum _{k=1}^{K}(|{m}_{j}^{k}|)}^{p},\quad j=1,2,3,p=0.5,$$
(5)

where $${m}_{j}^{k}$$ is k-th entry of m j . The weight values of recovered coefficients are plotted in Fig. 1(a–c). The horizontal axis denotes the coordinates of the representation vector while the vertical axis shows the weight quantities of representation coefficients, namely $$|{({m}_{j}^{\ast })}^{k}|,k=1,2,\cdots ,256;j=1,2,3$$. The curves clearly display the similar clustering and sparse pattern of three recovered coefficients. The correlations inspire us to assume that the multiple traces coefficients share the same distribution. For comparison, we jointly recover three coefficients simultaneously from D1, 2, 3 = [d1, d2, d3] $$\in$$Rm × 3 by a matrix minimization problem

$${M}_{1,2,3}^{\ast }=\text{arg}\mathop{\min }\limits_{{M}_{1,2,3}\in {R}^{K\times 3}}J({M}_{1,2,3})=\parallel L{M}_{1,2,3}-{D}_{1,2,3}{\parallel }_{F}^{2}+\alpha {\sum _{k=1}^{K}(\parallel {m}_{1,2,3}^{k}{\parallel }_{2})}^{p},\,\,p=0.5,$$
(6)

where $${M}_{1,2,3}^{\ast }\in {R}^{K\times 3}$$ and $${m}_{1,2,3}^{k}\in {R}^{3}$$ is the k-th row of M1, 2, 3. Since three vector minimizations as (5) are integrated to a matrix one (6), each entry $${m}_{j}^{k}$$ of representative vector is spanned to a row vector $${m}_{1,2,3}^{k}\in {R}^{3}$$. Hence the absolute values of weight entries in (5) are naturally generalized to l2 norm of row vector for its smoothness, that is $$|{m}_{j}^{k}|\to \parallel {m}_{1,2,3}^{k}{\parallel }_{2}$$. To illustrate the jointly recovered coefficient matrix $${M}_{1,2,3}^{\ast }$$ of (6) also follows the similar variation as in Fig. 1(a–c), we measure the l2 norm of each row vector in the joint sense corresponding to $$|{({m}_{j}^{\ast })}^{k}|$$,

$$|{({m}_{j}^{\ast })}^{k}|\to {\parallel {({M}_{1,2,3}^{\ast })}^{k}\parallel }_{2}={({|{({m}_{1,2,3}^{\ast })}_{1}^{k}|}^{2}+{|{({m}_{1,2,3}^{\ast })}_{2}^{k}|}^{2}+{|{({m}_{1,2,3}^{\ast })}_{3}^{k}|}^{2})}^{\frac{1}{2}},k=1,2,\cdots ,256.$$
(7)

Clearly, the joint representation coefficients also exhibit similar sparse pattern and weight concentration to the individual models (see Fig. 1(d)).

Under the assumption that multiple seismic wavefield traces jointly share the similar weight parameter pattern, we propose to express all the sub-sampled observations over the same bases simultaneously as

$$LM=D,$$
(8)

where D = [d1, d2, …, d l ] is composed of l seismic observations and M = [m1, m2, …, m l ] denotes the corresponding coefficient matrix. As far as the columns are concerned, the equation (8) is an easy consequence of the equation (3). Figure 1 has demonstrated that the multiple seismic traces are related to each other, especially when the samples are obtained in the similar fields. We reasonably measure the joint compactness and correlation of the multiple observations in row sense. By reviewing l q l p (q > 0, p > 0) model (4), we notice that the expression errors e j  = Lm j d j , j = 1, 2, …, l and the priori of representation coefficients are assumed to submit to the independent identically distribution,

$$P({e}_{j}|{m}_{j})\propto \underset{k=1}{\overset{K}{{\rm{\Pi }}}}\exp (-{|{e}_{j}^{k}|}^{q}),P({m}_{j})\propto \underset{k=1}{\overset{K}{{\rm{\Pi }}}}\exp (-{\alpha }_{k}{|{m}_{j}^{k}|}^{p}),{\alpha }_{k} > 0$$
(9)

where $${m}_{j}^{k}$$ is the k-th entry of representation vector m j $$\in$$RK. The solution $${m}_{j}^{\ast }$$ to (4) can be rewritten as the maximum likelihood estimation

$${m}_{j}^{\ast }=\text{arg}\mathop{\max }\limits_{{m}_{j}\in {R}^{K}}\,\mathrm{ln}\,P({m}_{j}|{e}_{j})=\text{arg}\mathop{\max }\limits_{{m}_{j}\in {R}^{K}}(\mathrm{ln}\,P({e}_{j}|{m}_{j})+\,\mathrm{ln}\,P({m}_{j}))$$
(10)

Because each coefficient component $${m}_{j}^{k}$$ in (3) is spanned to a row vector in the joint expression system (8), the absolute value of the scalar component is naturally replaced by a vector norm. Euclidean norm is preferred for its smoothness and easiness. Based on the analysis (9) and (10), the joint sparse priori of coefficient matrix M and fidelity error matrix E = LMD can be considered

$$P(E|M)\propto \underset{k=1}{\overset{K}{{\rm{\Pi }}}}\exp (-{\Vert {e}^{k}\Vert }_{2}^{q}),P(M)\propto \underset{k=1}{\overset{K}{{\rm{\Pi }}}}\exp (-{\alpha }_{k}{\Vert {m}^{k}\Vert }_{2}^{p}),{\alpha }_{k} > 0$$

where mk, ek are the k-th row vectors of M$$\in$$Rk × l and E$$\in$$Rm × l respectively.α k  > 0 is a constant and $${\Vert .\Vert }_{2}$$ stands for the Euclidean norm. In the similar relationship between (4) and (9), the joint matrix minimization approach for the ill-posed linear system (8) can be generally formulated as

$$\mathop{\min }\limits_{M}\,J(M)=\parallel LM-D{\parallel }_{2,q}^{q}+\parallel {\rm{\Lambda }}M{\parallel }_{2,p}^{p},\,\,\,\,\,\,\,\,q > 0,p > 0,$$
(11)

where the l2, p norm of the priori matrix M is defined as

$$\parallel M{\parallel }_{2,p}^{p}=\sum _{k=1}^{K}\parallel {m}^{k}{\parallel }_{2}^{p},\,\,\,\,p\in (0,1].$$
(12)

Here $$\parallel LM-D{\parallel }_{2,q}(q > 0)$$ denotes the l2, q matrix norm of LMD, $${\rm{\Lambda }}=diag{\{{\alpha }_{k}\}}_{k=1}^{K}$$ is a regularization matrix and its diagonal entry α k  > 0 is the regularization parameter for the k-th row of M. Especially, if M contains only one column m j , each $$\parallel {m}^{k}{\parallel }_{2}$$ is reduced to $$|{m}_{j}^{k}|$$ while $$\parallel M{\parallel }_{2,p}$$ is equivalent to $$\parallel {m}_{j}{\parallel }_{p}$$. When Λ takes scalar identity, the joint system (11) is exactly reduced to (4).

There are different choices of the parameter pair q > 0 and p > 0. Here we are interested in q = 2 and p$$\in$$ (0, 1] for the practical purpose. Extensive studies have illustrated that the fractional norm l p (p$$\in$$ (0, 1)) has better sparsity than l1 norm35,36,37,38,39. But the l p norm is neither Lipschitz nor convex which brings computational challenge. This paper presents a unified algorithm to solve the mixed l2, p regularized matrix minimization problem (11) for any p$$\in$$ (0, 1]. The computational results in seismic wavefield recovery validate the efficient performance of the joint matrix minimization approach. The convergence properties of our new algorithm are also analyzed.

## Algorithms

In this section, a unified method will be developed to solve the l2, ql2, p matrix minimization problem for any q = 2 and 0 < p ≤ 1. Especially when p is fractional, (11) is neither convex nor Lipschitz continuous which brings many computational difficulties. Actually the unconstrained l q -l p minimization is strongly NP-hard for any 0 < q or p < 140. Reweighed minimization algorithm35,41,42 is an efficient algorithm for solving the l2-l p (0 < p < 1) vector minimization problem which has been extended by Wang et al.43 to solve matrix minimization problem. Even the problem considered in43 is the special case of (11) with q = p$$\in$$ (0, 1], the idea motivates us to develop an iteratively quadratic algorithm for the generalized l2, p matrix minimization for p$$\in$$ (0, 1]. Moreover, the convergence analysis will be uniformly demonstrated.

After simple transformation, $$\parallel {\rm{\Lambda }}M{\parallel }_{2,p}^{p}$$ can be rewritten as

$$\parallel {\rm{\Lambda }}M{\parallel }_{2,p}^{p}=Tr({M}^{T}HM),$$
(13)

where $$Tr(\cdot )$$ stands for the trace operation and

$$H={\rm{diag}}\{\frac{{\alpha }_{1}}{\parallel {m}^{1}{\parallel }_{2}^{2-p}},\frac{{\alpha }_{2}}{\parallel {m}^{2}{\parallel }_{2}^{2-p}},\cdots ,\frac{{\alpha }_{K}}{\parallel {m}^{d}{\parallel }_{2}^{2-p}}\},$$
(14)

where mk (k = 1, 2, …, K) is the k-th row vector of M.

Hence the objective function of (11) for q = 2, p$$\in$$ (0, 1] can be reformulated as

$$\begin{array}{c}J(M):={\Vert LM-D\Vert }_{2,2}^{2}+{\Vert {\rm{\Lambda }}M\Vert }_{2,p}^{p}\\ \,\,\,\,\,\,=Tr({(LM-D)}^{T}(LM-D))+Tr({M}^{T}HM)\end{array}$$
(15)

It is well known that the KKT point of the unconstrained optimization problem (11) is also the stationary point of J(M)44. Compute the derivative of J(M) with respect to matrix M and set it to zero, we get the KKT equation of the problem (11) as follows

$$\frac{\partial J(M)}{\partial M}=2{L}^{T}(LM-D)+pHM=0.$$
(16)

Thus solving (11) is reduced to finding the solution of the nonlinear equation (16). If H is fixed and the matrix $$N={L}^{T}L+\frac{p}{2}H$$ is invertible, equation (16) can be solved by

$$M={({L}^{T}L+\frac{p}{2}H)}^{-1}{L}^{T}D.$$
(17)

We notice that if some row of M is zero, the diagonal entries of H cannot be generated, nor can N. Then the iteration breaks down. In view of the seismic wavefield recovery, the zero row means the corresponding basis function has no contribution to reconstruct all the observed seismic traces. For example, if mk = 0, then L k (the k-th column of transformation matrix L) is nothing with the observations D in the representation system (8). To avoid the possible breakdown of the matrix N in (17) and reasonably explain this numerical behavior, we apply the Sherman-Morrison-Woodbury formula45 to N−1. Denote

$$G={(\frac{p}{2}H)}^{-1}=\frac{2}{p}{\rm{diag}}\{\frac{\parallel {m}^{1}{\parallel }_{2}^{2-p}}{{\alpha }_{1}},\frac{\parallel {m}^{2}{\parallel }_{2}^{2-p}}{{\alpha }_{2}},\cdots ,\frac{\parallel {m}^{K}{\parallel }_{2}^{2-p}}{{\alpha }_{K}}\},$$
(18)

then the formula (17) can be rewritten as

$$M={N}^{-1}{L}^{T}D=[G-G{L}^{T}{({I}_{m}+LG{L}^{T})}^{-1}LG]{L}^{T}D,$$
(19)

where I m is m-dimensional identity operator. If matrices G and M are computed alternatively corresponding to equations (18) and (19) respectively, then an iterative procedure can be naturally developed

$$\begin{array}{l}{G}_{t}=\frac{2}{p}diag\{\frac{\parallel {m}_{t}^{1}{\parallel }_{2}^{2-p}}{{\alpha }_{1}},\frac{\parallel {m}_{t}^{2}{\parallel }_{2}^{2-p}}{{\alpha }_{2}},\cdots ,\frac{\parallel {m}_{t}^{K}{\parallel }_{2}^{2-p}}{{\alpha }_{K}}\},\\ {M}_{t+1}=[{G}_{t}-{G}_{t}{L}^{T}{({I}_{m}+L{G}_{t}{L}^{T})}^{-1}L{G}_{t}]{L}^{T}D.\end{array}$$
(20)

The iterative algorithm is outlined in Algorithm 1.

Algorithm 1. An iterative procedure for solving problem (16)

Step 1. Input L$$\in$$Rm × K, D$$\in$$Rm × l. Set the sparse parameter p$$\in$$ (0, 1] and diagonal matrix $${\rm{\Lambda }}=diag\{{\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{K}\}\,\succ \,0$$ (here $$\succ$$ refers to the positive definite). Given the stopping criterion $$\epsilon > 0$$.

Step 2. Set t = 1 and initialize M1$$\in$$RK × l.

Step 3. For t = 1, 2, … until $${\rho }_{t}\le \epsilon$$ do:

$${G}_{t}=\frac{2}{p}diag\{\frac{\parallel {m}_{t}^{1}{\parallel }_{2}^{2-p}}{{\alpha }_{1}},\frac{\parallel {m}_{t}^{2}{\parallel }_{2}^{2-p}}{{\alpha }_{2}},\cdots ,\frac{\parallel {m}_{t}^{K}{\parallel }_{2}^{2-p}}{{\alpha }_{K}}\};$$
$${M}_{t+1}=[{G}_{t}-{G}_{t}{L}^{T}{({I}_{m}+L{G}_{t}{L}^{T})}^{-1}L{G}_{t}]{L}^{T}D;$$
$${\rho }_{t}=\frac{\parallel {M}_{t+1}-{M}_{t}{\parallel }_{F}}{\parallel {M}_{t}{\parallel }_{F}}.$$

The $${m}_{t}^{k}$$ (k = 1, 2, …, K) means the k-th row vector of M t . Algorithm 1 aims to solve the fixed-point system (16) which is the stationary equation of the matrix function (15). Based on the iterative procedure of Algorithm 1, the iterative point M k is the solution of the nonlinear equation (16) if and only if M t  = [G t G t LT(I m  + LG t LT)−1LG t ]LTD which is equivalent to M k  = Mk + 1. From this iteration on, the iteration point will not update which indicates that a stationary point has been found. Hence the stopping criterion of Algorithm 1 can be chosen as $${\rho }_{t}:=\frac{\parallel {M}_{t+1}-{M}_{t}{\parallel }_{F}}{\parallel {M}_{t}{\parallel }_{F}}\le \epsilon$$, where $$\parallel \cdot {\parallel }_{F}$$ stands for the Frobenius norm46.

Based on the definition (12) of $$\parallel M{\parallel }_{2,p}$$, the sparse parameter p$$\in$$ (0, 1] aims to find a solution with many zero row vectors of the l2, p-regularized matrix minimization problem (11). This means that many basis functions have no contribution to reconstruct the seismic wavefields which accords with the prior knowledge. Therefore (m t )k = 0 might frequently occur during the iterations of Algorithm 1. We may formulate the following statement.

Remark. In Algorithm 1, if $${m}_{{t}_{0}}^{k}=0$$ happens for some iteration $${M}_{{t}_{0}}$$, then $${m}_{t}^{k}=0$$ for tt0.

We give explanations of the above remark as follow. If $${m}_{{t}_{0}}^{k}=0$$ in the t0-th iteration, then the diagonal entry of $${G}_{{t}_{0}}$$ is zero, namely $${({G}_{{t}_{0}})}_{kk}=0$$. From the update formula $${M}_{{t}_{0}+1}={G}_{{t}_{0}}[{I}_{K}-{L}^{T}{({I}_{m}+L{G}_{{t}_{0}}{L}^{T})}^{-1}L{G}_{{t}_{0}}]{L}^{T}D$$, we know that $${m}_{{t}_{0}+1}^{k}=0$$ holds, so does $${m}_{t}^{k}=0$$ for tt0. After t0 iterations with $${m}_{{t}_{0}}^{k}=0$$, the k-th column of the matrix L is unnecessary in the linear system (8) and the variational function J(M) in (15). So we can discard the k-th column of the matrix L to reduce the system without any loss. The improvement of Algorithm 1 can be concluded as Algorithm 2.

Algorithm 2. Solving problem (16) for any p$$\in$$ (0, 1]

Step 1. Input L$$\in$$Rm × K, D$$\in$$Rm × l. Set the sparse parameter p$$\in$$ (0, 1] and the diagonal matrix $${\rm{\Lambda }}=diag\{{\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{K}\}\,\succ \,0$$. Given stopping criterion $$\epsilon > 0$$.

Step 2. Set t = 1 and initialize $${\hat{M}}_{1}\in {R}^{K\times l}$$. Let Ω0 = {1, 2, …, K}.

Step 3. For t = 1, 2, … until $${\rho }_{t}\le \epsilon$$ do:

$${{\rm{\Omega }}}_{t}={{\rm{\Omega }}}_{t-1}\setminus \{k:\,\parallel {\hat{m}}_{t}^{k}{\parallel }_{2}=0\};$$
$${M}_{t}={\hat{M}}_{t}({{\rm{\Omega }}}_{t};:),\,{L}_{t}=L(:;{{\rm{\Omega }}}_{t});$$
$${G}_{t}=\frac{2}{p}diag{\{\frac{\parallel {m}_{t}^{k}{\parallel }_{2}^{2-p}}{{\alpha }_{k}}\}}_{k\in {{\rm{\Omega }}}_{t}};$$
$${\hat{M}}_{t+1}=[{G}_{t}-{G}_{t}{L}_{t}^{T}{({I}_{m}+{L}_{t}{G}_{t}{L}_{t}^{T})}^{-1}{L}_{t}{G}_{t}]{L}_{t}^{T}D;$$
$${\rho }_{t}=\frac{\parallel {\hat{M}}_{t+1}-{M}_{t}{\parallel }_{F}}{\parallel {M}_{t}{\parallel }_{F}}.$$

In Algorithm 2, $${M}_{t}={\hat{M}}_{t}({{\rm{\Omega }}}_{t};:)$$ means to keep the rows of $${\hat{M}}_{t}$$ corresponding to the index set Ω t while L t  = L(:;Ω t ) keeps the column of L corresponding to Ω t . Compared with Algorithm 1, Algorithm 2 removes the zero rows of the approximation solution in each iteration and the corresponding columns of the bases matrix L. This technique iteratively reduces the inactive set of data.

Based on the procedure of Algorithm 2, $${N}_{t}={L}_{t}^{T}{L}_{t}+\frac{p}{2}{H}_{t}$$ is well defined and $${\hat{M}}_{t+1}$$ is the solution of the linear system $${N}_{t}M={L}_{t}^{T}D$$. Since N t is symmetric and positive definite, $${\hat{M}}_{t+1}$$ is also the optimal matrix solution of the following quadratic subproblem

$$\mathop{\min }\limits_{M}\,{Q}_{t}(M):=Tr({({L}_{t}M-D)}^{T}({L}_{t}M-D))+Tr({M}^{T}HM).$$
(21)

We would have $${Q}_{t}({\hat{M}}_{t+1})\le {Q}_{t}({M}_{t})$$, which is equivalent to

$$\parallel {L}_{t}{\hat{M}}_{t+1}-D{\parallel }_{F}^{2}+\frac{p}{2}\sum _{k\in {{\rm{\Omega }}}_{t}}\frac{{\alpha }_{k}\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}^{2}}{\parallel {m}_{t}^{k}{\parallel }_{2}^{2-p}}\le \parallel {L}_{t}{M}_{t}-D{\parallel }_{F}^{2}+\frac{p}{2}\parallel {{\rm{\Lambda }}}_{t}{M}_{t}{\parallel }_{2,p}^{p}.$$
(22)

It is noticed that $$J({M}_{t})=\parallel {L}_{t}{M}_{t}-D{\parallel }_{F}^{2}+\parallel {{\rm{\Lambda }}}_{t}{M}_{t}{\parallel }_{2,p}^{p}$$ and $$J({M}_{t+1})=J({\hat{M}}_{t+1})$$. Using inequalities (A-2) (see the Appendix A) and (22), we can derive that

$$\begin{array}{l}J({M}_{t+1})=J({\hat{M}}_{t+1})\le J({M}_{t}),\,\,\,\,p\in (0,1],\end{array}$$
(23)

which means {J(M t )} will decrease with respect to iterations for any p$$\in$$ (0, 1].

Once J(Mt + 1) = J(M t ) happens for some t, the equalities in (A-2) (see the Appendix A) and (22) hold simultaneously. From Proposition 2 of the Appendix A, we obtain $$\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}=\parallel {m}_{t}^{k}{\parallel }_{2}$$ for all k$$\in$$ Ω t . Thus Gt + 1 = G t and Ht + 1 = H t , which implies that $${\hat{M}}_{t+1}$$ is a solution of the equation (17). Since the objective function sequence {J(M t )} for all t is strictly decreasing and lower bounded, any accumulation of the set {M t } is a stationary point of the equation (11). At the same time, the descending quantity of {J(M t )} measures the convergence precision of the matrix sequence {M t }.

Once the nonzero set of the t-th iteration has been fixed, the subproblem (21) can be solved in a variety of ways such as preconditioned conjugate gradient methods46, nonmonotone gradient descent methods47,48, and so on. The framework can be concluded as Algorithm 3.

Algorithm 3. A unified algorithm for solving problem (16) for any p$$\in$$ (0, 1]

Step 1. Input L$$\in$$Rm × K, D$${\rm{\Lambda }}=diag\{{\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{K}\}\,\succ \,0$$$$\in$$Rm × l. Set the sparse parameter p$$\in$$ (0, 1] and the diagonal matrix . Given stopping criterion $$\epsilon > 0$$.

Step 2. Set t = 1 and initialize $${\hat{M}}_{1}\in {R}^{K\times l}$$. Let Ω0 = {1, 2, …, K}.

Step 3. For t = 1, 2, … until $${\rho }_{t}\le \epsilon$$ do:

$${{\rm{\Omega }}}_{t}={{\rm{\Omega }}}_{t-1}\setminus \{k|\parallel {\hat{m}}_{t}^{k}{\parallel }_{2}=0\};$$
$${M}_{t}={\hat{M}}_{t}({{\rm{\Omega }}}_{t};:);\,{L}_{t}=L(:;{{\rm{\Omega }}}_{t});$$
$${H}_{t}={\rm{diag}}{\{\frac{{\alpha }_{k}}{\parallel {m}_{t}^{k}{\parallel }_{2}^{2-p}}\}}_{k\in {{\rm{\Omega }}}_{t}};$$
$${N}_{t}={L}_{t}^{T}{L}_{t}+\frac{p}{2}{H}_{t};$$

Solve $${N}_{t}M={L}_{t}^{T}D$$ for the solution $${\hat{M}}_{t+1}$$;

$${\rho }_{t}=\frac{\parallel {\hat{M}}_{t+1}-{M}_{t}{\parallel }_{F}}{\parallel {M}_{t}{\parallel }_{F}}.$$

## Experimental results

To validate the efficiency of the joint matrix minimization approach and the unified algorithm for the problem (11), we perform three tests: (1) restoration of the input one-dimensional random signal with the randomly generated matrix L; (2) restoration of the synthetic seismic data with random loss of traces; (3) restoration of the field data.

### One-dimensional signal reconstruction

We randomly take samples to generate the matrix L. For implementation, we try to restore the signal by the model (11) with q = 2 and p$$\in$$ (0, 1].

The stopping precision in Algorithm 3 is set to $$\epsilon ={10}^{-3}$$. The sparse parameter p and regularization parameter α k are typically chosen in (0, 1]. Results for other values of p are similar. The relative error of the recovered signal Mrec to the true (given) signal Mtrue is defined by

$${{\rm{err}}}_{{\rm{rel}}}=\frac{\parallel {M}_{{\rm{rec}}}-{M}_{{\rm{true}}}{\parallel }_{2}}{\parallel {M}_{{\rm{true}}}{\parallel }_{2}}.$$

To quantify the results, we define the signal-to-noise ratio (SNR) as $${\rm{SNR}}=10{\mathrm{log}}_{10}\frac{\parallel {d}_{{\rm{org}}}{\parallel }_{2}^{2}}{\parallel {d}_{{\rm{org}}}-{d}_{{\rm{rec}}}{\parallel }_{2}^{2}}$$, where dorg refers to the original data and drec is the restored data.

For the one-dimensional case, the matrix M is reduced to a vector, hence the unified Algorithm 3 can be used for solving (11). For comparison, we also apply spectral projected gradient (SPG) method49 to solve the l1-regularization problem. The code of SPG is downloaded from http://www.cs.ubc.ca/~mpf/spgl1/index.html. Two algorithms are carried out in the same environment and choose their best regularization parameters. The comparison items include errrel value, SNR and CPU running time (second). Each experiment is repeated five times and the average values are reported in Table 1. It indicates that both methods perform well for one-dimensional signal reconstruction problem.

Apart from the regular data, we also consider the noisy cases to show the robustness of two methods. Different noise levels are added to the simulated data. Noise level 0.001 means the noise is randomly generated with zero mean and 0.001 variance. The results of Algorithm 3 with sparse parameters p = 1 and p = 0.5 are displayed in Table 1. Compared with the l1-regularized minimization model, the half-norm regularized minimization behaves better in reconstruction. Figure 2 plots the recovery performance of the Algorithm 3 with p = 0.5 on noisy data. Figure 2(a) is the comparison of the real signal and the recovered signal, Fig. 2(b) illustrates the difference between the recovered signal and the input (true) signal. The recovery images of other cases are similar. The figures reveal that our model and algorithm perform well for one-dimensional seismic wavefield reconstruction problem even in noisy cases.

### Reconstruction of seismograms from a layered model

Now we consider a seismogram generated from a seven layers geologic velocity model where the spatial sampling interval is 15 meters and the time sampling interval is 0.002 second. The velocity varies from 2500 m/s to 5500 m/s. The seismogram is generated using a source function given by a Ricker wavelet with central-frequency of 25 Hz. The dataset contains 256 traces with 256 time samples in each trace. Different percentages of missing traces in original data, 10%, 25% and 50%, are used to test the limitation of recovery methods. The joint matrix model (11) with Algorithm 3 is applied to reconstruct the seismic wavefield. Since the spectral projected gradient method only solves an l1-regularized vector minimization problem, we decompose the matrix representation system (11) into the l1-regularized vector minimization problem. Each column is considered as a subproblem to reconstruct its weight vector separately. Then all the solutions of the subproblems are sequentially aligned into a weighted matrix to evaluate the reconstruction performance. The experimental results on missing percentages 10% and 25% can be seen in Tables 2 and 3.

As for the data without noise but missing 50% traces, the reconstruction performance of joint matrix model with Algorithm 3 is much worse than missing percentages of 10% and 25%. The errrel value is 0.5414 and SNR is around 5.1904dB, almost the same for any p$$\in$$ (0, 1]. These results mean that our method may not completely recover the seismic wavefield well if the missing trace signals are more than 50%. Actually, the sub-sampled data missing 50% itself is a failed collection of seismic recodes.

The original shot gathers are shown in Fig. 3(a). The data with 25% traces missing are shown in Fig. 3(b). In forming the under-determined matrix L, a Haar wavelet orthogonal base is used to form the transform matrix Ψ. The unified Algorithm 3 is applied to solve the joint matrix minimization problems (11) with q = 2 and typical parameters p$$\in$$ (0, 1]. Good recovery performance is observed and the result is demonstrated in Fig. 3(c). The error of the original and the recovered data shown in Fig. 3(d) illustrates the efficient recovery performance of joint matrix minimization approach. In displaying the results, the amplitude scale of the error map is the same as the amplitude scale of the data. Of course, other values of the sparse parameter p can be chosen, the results in visualization are similar. So, we only list the quantitative results in Tables 2 and 3.

Comparatively, the recovery image of the SPG algorithm for the case of 25% traces missing is presented in Fig. 4. Figure 4(a) is the reconstruction and Fig. 4(b) displays the difference between the original and reconstructed seismic signals. It is noticed that SPG algorithm for the l1-regularization vector minimization restores the seismic wavefield as accurate as the joint matrix approach with Algorithm 3. These results are obtained using the same code from http://www.cs.ubc.ca/~mpf/spgl1/index.html.

To show the anti-noise property of our algorithm, we add random noise with noise level 0.001 to the simulated data. The unified Algorithm 3 is applied to solve the joint matrix minimization problems. The errrel value, SNR and CPU running time (second) are listed in Table 2 for 3 sparse parameters. The recovery image and the error of the original and the recovered data are shown in Fig. 5(a and b) respectively. The low relative error and high SNR indicate that our algorithm is stable for seismic data restoration.

To save memory requirement of large-scale data, we have observed the restoration behavior of our method on patch of the input synthetic data. We evenly partition the collection of trace signals D into several blocks, such as D = [D1, D2, …, D f ], where $${D}_{g}\in {R}^{m\times {l}_{g}}$$ and $$\sum _{g=1}^{f}{l}_{g}=l$$. Each D g is input separately to recover the seismic signals by system (11). Then all the sub-solutions M g , g = 1, 2, …, f are combined into M = [M1, M2, …, M f ]. When the number of segments is two or three, the recovered errrel values and SNR are almost the same as the integral case. When each column is considered as a segment, the joint matrix model is reduced to a sequence of vector recoveries, the recovery errrel values and SNR are similar to the integral case but the computational time is around 50 times more.

### Reconstruction of seismograms from a heterogeneous model

Next we consider a seismogram generated from a velocity model varying both vertically and transversely (Wang et al.5). The original seismic wavefield, sub-sampled data (37% traces are randomly removed) and recovered data are shown in Fig. 6(a–c), respectively. The difference of the original data and the recovered data is illustrated in Fig. 6(d). In displaying the results, the amplitude scale of the error map is the same as the amplitude scale of the data. It illustrates that all the initial seismic energy is recovered with minor errors. Though the reconstruction is not perfect, most of the details of the wavefield are preserved. Again, to test the quality of our algorithm in seismic data restoration for complex structure, we calculate the signal-to-noise ratio and the relative error. From our calculation, for p = 0.5, the values of SNR and errrel are 26.9792 and 0.0448, respectively; for p = 1, the values of SNR and errrel are 25.6940 and 0.0519, respectively. The high value of SNR and low value of errrel indicate our algorithm works for seismic data restoration even with complex structure.

To show the robustness of our algorithm to interference, we add random noise with level 0.001 and 0.01 to the simulated data respectively. The unified Algorithm 3 with p = 0.5 is applied to solve the joint matrix minimization problems. The values of SNR and errrel for noise level equaling 0.001 are 26.9074 and 0.0451, and for noise level equaling 0.001 are 18.0355 and 0.1254, respectively.

In the noisy case, e.g., noise level equaling 0.01, the frequency information of the original data, sub-sampled data and the recovered data are shown in Fig. 7(a–c), respectively. Again, the aliasing of the sub-sampled data is reduced greatly in the recovered data.

### Field data

Finally, we examine the efficiency of the new method with field data. The seismic data is a marine shot gather shown in Fig. 8(a) which consists of 256 traces with spacing 25 m and time sampling interval 2 ms. There are damaged traces in the original gather. The subsampled gather is shown in Fig. 8(b) with 42% of the original traces randomly removed. This sub-sampled gather was used to restore the original gather with suitable solution methods. Again, the unified Algorithm 3 is applied to solve the joint matrix minimization problems (11) with q = 2 and p = 0.5. The recovery result is demonstrated in Fig. 8(c). The error of the original and the recovered data shown in Fig. 8(d) illustrates the efficient recovery performance of joint matrix minimization approach. In displaying the results, the amplitude scale of the error map is the same as the amplitude scale of the data. Comparing the subsampled image with the original image, the restored image can reconstruct most of the details. In addition the damaged trace in the original gather was restored as a good trace. Using the same definition of SNR as above, for p = 0.5, the value of SNR equals 19.7301; for p = 1 the value of SNR equals 19.7919. We only show figures for p = 1, since in visualization the results are similar for p = 0.5.

The frequency information of the original data, sub-sampled data and the recovered data are shown in Fig. 9(a–c), respectively. It indicates that the aliasing of the sub-sampled data is reduced greatly in the recovered data.

## Conclusion

Sparse optimization has broad applications in seismic data processing. In this paper we focus on data restoration problem. Noticing that the seismic wavefield can be represented using matrix instead of vector as weight variable to express all the signals simultaneously, in this paper we propose a matrix optimization model to the seismic wavefield recovery. We first reformulate the data restoration problem using an l2, p-norm constrained matrix minimization model for any p$$\in$$ (0, 1], which is a nonconvex and non-Lipschitz continuous minimization problem. Then we develop a unified algorithm to solve the mixed matrix optimization problem for any p$$\in$$ (0, 1]. Convergence analysis of the new algorithm is also addressed. Numerical results on synthetic problems and the field data example indicate potential usage of our method for practical applications.

### Appendix Properties of the new algorithms

In this section, we will analyze the convergence property of the Algorithm 2. The main result is that the objective function J(M t ) strictly decreases with respect to iterations until the matrix sequence {M t } converges to a stationary point of J(M).

Proposition 1. Let $$\phi (\tau )=\tau -a{\tau }^{\frac{1}{a}}$$ be a function of the variable τ, where a$$\in$$ (0, 1). Then for any τ > 0, φ(τ) ≤ 1−a, and τ = 1 is the unique maximizer.

To verify the above statements, let us take the derivative of φ(τ) and set it to be zero, that is

$$\phi ^{\prime} (\tau )=1-{\tau }^{\frac{1}{a}-1}=0,$$

then φ′(τ) = 0 has the unique solution τ = 1 for any a$$\in$$ (0, 1) which is just the maximizer of φ(τ) in (0,  +∞).

Based on Proposition 1, for a given a$$\in$$ (0, 1),

$$\tau -a{\tau }^{\frac{1}{a}}\le 1-a$$
(A-1)

holds for τ$$\in$$ (0,  +∞) and “=’’ is active if and only if τ = 1. Let a takes special values such as $$a=\frac{p}{2}\,(p\in (0,1])$$, the inequality (A-1) will result in the following formula associated with $$||M|{|}_{2,p}^{p}(0 < p\le 1)$$.

Proposition 2. Suppose that M t and $${\hat{M}}_{t+1}$$ are generated in the t-th iteration by Algorithm 2, the following inequality holds,

$$\parallel {{\rm{\Lambda }}}_{t}{\hat{M}}_{t+1}{\parallel }_{2,p}^{p}-\frac{p}{2}\sum _{k\in {{\rm{\Omega }}}_{t}}\frac{{\alpha }_{k}\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}^{2}}{\parallel {m}_{t}^{k}{\parallel }_{2}^{2-p}}\le (1-\frac{p}{2})\parallel {{\rm{\Lambda }}}_{t}{M}_{t}{\parallel }_{2,p}^{p},\,p\in (0,1],$$
(A-2)

where $${{\rm{\Lambda }}}_{t}=diag{\{{\alpha }_{k}\}}_{k\in {{\rm{\Omega }}}_{t}}$$. Moreover, the equality in (A-2) holds if and only if $$\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}=\parallel {m}_{t}^{k}{\parallel }_{2}$$ for k$$\in$$ Ω t .

Consider the approximate value M t . Since each $${\Vert {m}_{t}^{k}\Vert }_{2}\ne 0$$ for k$$\in$$ Ω t , so we can r $$\tau =\frac{{\Vert {\hat{m}}_{t+1}^{k}\Vert }_{2}^{p}}{{\Vert {m}_{t}^{k}\Vert }_{2}^{p}}$$ and $$a=\frac{p}{2}$$ in Proposition 1. It can be obtained that

$$\frac{\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}^{p}}{\parallel {m}_{t}^{k}{\parallel }_{2}^{p}}-\frac{p}{2}\frac{\parallel {\hat{m}}_{t+1}^{k}{\parallel }_{2}^{2}}{\parallel {m}_{t}^{k}{\parallel }_{2}^{2}}\le 1-\frac{p}{2},\quad k\in {{\rm{\Omega }}}_{t}\,.$$
(A-3)

Multiplying equation (A-3) by $${\alpha }_{k}\parallel {m}_{t}^{k}{\parallel }_{2}^{p}$$, we have the following inequality

$${\alpha }_{k}{\Vert {\hat{m}}_{t+1}^{k}\Vert }_{2}^{p}-\frac{p}{2}\frac{{\alpha }_{k}{\Vert {\hat{m}}_{t+1}^{k}\Vert }_{2}^{2}}{{\Vert {m}_{t}^{k}\Vert }_{2}^{2-p}}\le (1-\frac{p}{2}){\alpha }_{k}{\Vert {m}_{t}^{k}\Vert }_{2}^{p},\quad k\in {{\rm{\Omega }}}_{t}.$$
(A-4)

Summing up k$$\in$$ Ω t in formula (A-4}), we can derive at (A-2).

Based on Proposition 1, τ = 1 is the unique minimizers for φ(τ) in (0,  +∞) when $$a=\frac{p}{2}$$. Namely, $${\Vert {\hat{m}}_{t+1}^{k}\Vert }_{2}={\Vert {m}_{t}^{k}\Vert }_{2}(k\in {{\rm{\Omega }}}_{t})$$ is necessary and sufficient for equality holding in (A-4). Now, we can establish the following convergence property of the Algorithm 2.

Proposition 3. Suppose that {M t } is the matrix sequence generated by Algorithm 2. Then J(M t ) strictly decreases with respect to t for any 0 < p ≤ 1 until {M t } converges to a stationary point of J(M).

Proposition 4. Based on the derivation of Proposition 3, so long as the subproblem (16) is solved with $${Q}_{t}({\hat{M}}_{t+1})\le {Q}_{t}({M}_{t})$$, the convergence of Algorithm 3 will be guaranteed for any p$$\in$$ (0, 1].