Compressed sensing and best $k$-term approximation

By Albert Cohen, Wolfgang Dahmen, and Ronald DeVore

Abstract

Compressed sensing is a new concept in signal processing where one seeks to minimize the number of measurements to be taken from signals while still retaining the information necessary to approximate them well. The ideas have their origins in certain abstract results from functional analysis and approximation theory by Kashin but were recently brought into the forefront by the work of Candès, Romberg, and Tao and of Donoho who constructed concrete algorithms and showed their promise in application. There remain several fundamental questions on both the theoretical and practical sides of compressed sensing. This paper is primarily concerned with one of these theoretical issues revolving around just how well compressed sensing can approximate a given signal from a given budget of fixed linear measurements, as compared to adaptive linear measurements. More precisely, we consider discrete signals $x\in \mathbb{R}^N$, allocate $n<N$ linear measurements of $x$, and we describe the range of $k$ for which these measurements encode enough information to recover $x$ in the sense of $\ell _p$ to the accuracy of best $k$-term approximation. We also consider the problem of having such accuracy only with high probability.

1. Introduction

The typical paradigm for obtaining a compressed version of a discrete signal represented by a vector $x\in \mathbb{R}^N$ is to choose an appropriate basis, compute the coefficients of $x$ in this basis, and then retain only the $k$ largest of these with $k<N$. If we are interested in a bit stream representation, we also need in addition to quantize these $k$ coefficients.

Assuming, without loss of generality, that $x$ already represents the coefficients of the signal in the appropriate basis, this means that we pick an approximation to $x$ in the set $\Sigma _k$ of $k$-sparse vectors

$$\begin{equation} \Sigma _k:=\{x\in \mathbb{R}^N\; :\; \#\operatorname {supp}(x)\le k\}, \cssId{sigmak}{\tag{1.1}} \end{equation}$$

where $\operatorname {supp}(x)$ is the support of $x$, i.e., the set of $i$ for which $x_i\neq 0$, and $\#A$ is the number of elements in the set $A$. The best performance that we can achieve by such an approximation process in some given norm $\|\cdot \|_X$ of interest is described by the best $k$-term approximation error

$$\begin{equation} \sigma _k(x)_X:=\inf _{z\in \Sigma _k}\|x-z\|_X. \cssId{sigmaerr}{\tag{1.2}} \end{equation}$$

This approximation process should be considered as adaptive since the indices of those coefficients which are retained vary from one signal to another. On the other hand, this procedure is stressed on the front end by the need to first compute all of the basis coefficients. The view expressed by Candès, Romberg, and Tao Reference 5Reference 3Reference 4 and Donoho Reference 8 is that since we retain only a few of these coefficients in the end, perhaps it is possible to actually compute only a few nonadaptive linear measurements in the first place and still retain the necessary information about $x$ in order to build a compressed representation.

These ideas have given rise to a very lively area of research called compressed sensing which poses many intriguing questions, of both a theoretical and practical flavor. The present paper is an excursion into this area, focusing our interest on the question of just how well compressed sensing can perform in comparison to best $k$-term approximation.

To formulate the problem, we are given a budget of $n$ questions we can ask about $x$. These questions are required to take the form of asking for the values $\lambda _1(x),\dots ,\lambda _n(x)$ where the $\lambda _j$ are fixed linear functionals. The information we gather about $x$ can therefore by described by

$$\begin{equation} y=\Phi x, \cssId{info1}{\tag{1.3}} \end{equation}$$

where $\Phi$ is an $n\times N$ matrix called the encoder and $y\in \mathbb{R}^n$ is the information vector. The rows of $\Phi$ are representations of the linear functionals $\lambda _j$, $j=1,\dots ,n$.

To extract the information that $y$ holds about $x$, we use a decoder $\Delta$ which is a mapping from $\mathbb{R}^n\to \mathbb{R}^N$. We emphasize that $\Delta$ is not required to be linear. Thus, $\Delta (y)=\Delta (\Phi x)$ is our approximation to $x$ from the information we have retained. We shall denote by ${\mathcal{A}}_{n,N}$ the set of all encoder-decoder pairs $(\Phi ,\Delta )$ with $\Phi$ an $n\times N$ matrix.

There are two common ways to evaluate the performance of an encoding-decoding pair $(\Phi ,\Delta )\in {\mathcal{A}}_{n,N}$. The first is to ask for the largest value of $k$ such that the encoding-decoding is exact for all $k$-sparse vectors, i.e.,

$$\begin{equation} x\in \Sigma _k\Rightarrow \Delta (\Phi x)=x. \cssId{exact}{\tag{1.4}} \end{equation}$$

It is easy to see (see §2) that given $n,N$, there are $(\Delta ,\Phi )\in {\mathcal{A}}_{n,N}$ such that (Equation 1.4) holds for all $k\le n/2$. Or put in another way, given $k$, we can achieve exact recovery on $\Sigma _k$ whenever $n\ge 2k$. Unfortunately such encoder/decoder pairs are not numerically friendly as is explained in §2.

Generally speaking, our signal will not be in $\Sigma _k$ with $k$ small but may be approximated well by the elements in $\Sigma _k$. Therefore, we would like our algorithms to perform well in this case as well. One way of comparing compressed sensing with best $k$-term approximation is to consider their respective performance on a specific class of vectors $K\subset \mathbb{R}^N$. For such a class we can define on the one hand

$$\begin{equation} \sigma _k(K)_X:=\sup _{x\in K}\sigma _k(x)_X \cssId{bestkK}{\tag{1.5}} \end{equation}$$

and

$$\begin{equation} E_n(K)_X:=\inf _{(\Phi ,\Delta )\in {\mathcal{A}}_{n,N}} \sup _{x\in K} \|x-\Delta (\Phi x)\|_X \cssId{csK}{\tag{1.6}} \end{equation}$$

which describe, respectively, the performance of the two methods over this class. We are now interested in the largest value of $k$ such that $E_n(K)_X\le C_0\sigma _k(K)_X$ for a constant $C_0$ independent of the parameters $k,n,N$. Results of this type were established already in the 1970’s under the umbrella of what is called $n$-widths. The deepest results of this type were given by Kashin Reference 14 with later improvements by Garnaev and Gluskin Reference 9Reference 13. We recall this well-known story briefly in §2.

The results on $n$-widths referred to above give matching upper and lower estimates for $E_n(K)_X$ in the case that $K$ is a typical sparsity class such as a ball in $\ell _p^N$ where

$$\begin{equation} \|x\|_{\ell _p}:=\|x\|_{\ell _p^N}:=\left\{\begin{array}{ll} \left(\sum _{j=1}^N|x_j|^p \right)^{1/p}, &0 < p<\infty ,\\\max _{j=1,\dots ,N} |x_j|,& p=\infty . \end{array}\right. \cssId{lp}{\tag{1.7}} \end{equation}$$

This in turn determines the largest range of $k$ for which we can obtain comparisons of the form $E_n(K)_X\le C_0\sigma _k(K)_X$. One such result is the following: for $K=U(\ell _1^N)$, $X=\ell _2^N$, one has

$$\begin{equation} E_k(U(\ell _1^N))_{\ell _2^N}\le C_0\sigma _k(U(\ell _1^N))_{\ell _2^N} \cssId{l11}{\tag{1.8}} \end{equation}$$

whenever

$$\begin{equation} k\le c_0n/\log (N/n) \cssId{l12}{\tag{1.9}} \end{equation}$$

with absolute constants $C_0,c_0$.

The decoders used in proving these theoretical bounds are far from being practical or numerically implementable. One of the remarkable achievements of the recent work of Candès, Romberg and Tao Reference 3 and Donoho Reference 8 is to give probabilistic constructions of matrices $\Phi$ which provide these bounds where the decoding can be done by solving the $\ell _1$ minimization problem

$$\begin{equation} \Delta (y):=\operatorname *{Argmin}_{\Phi z=y} \|z\|_{\ell _1}. \cssId{l1}{\tag{1.10}} \end{equation}$$

The above results on approximation of classes is governed by the worst elements in the class. It is a more subtle problem to obtain estimates that depend on the individual characteristics of the target vector $x$. The main contribution of the present paper is to study a stronger way to compare the performance of $k$-term approximation in a compressed sensing algorithm. Namely, we address the following question:

For a given norm $\|\cdot \|_X$ and $k<N$, what is the minimal value of $n$ for which there exists a pair $(\Phi ,\Delta )\in {\mathcal{A}}_{n,N}$ such that

$$\begin{equation} \|x-\Delta (\Phi x)\|_{X} \leq C_0 \sigma _k(x)_X, \cssId{inst}{\tag{1.11}} \end{equation}$$

for all $x\in \mathbb{R}^N$, with $C_0$ a constant independent of $k$ and $N$?

If a result of the form (Equation 1.11) has been established, then one can derive a result for a class $K$ by simply taking the supremum over all $x\in K$. However, results on classes are less precise and informative than (Equation 1.11).

We shall say a pair $(\Phi ,\Delta )\in {\mathcal{A}}_{n,N}$ satisfying (Equation 1.11) is instance optimal of order $k$ with constant $C_0$ for the space $X$. In particular, we want to understand under what circumstances the minimal value of $n$ is roughly of the same order as $k$, similar to (Equation 1.9). We shall see that the answer to this question strongly depends on the norm $X$ under consideration.

The approximation accuracy of a compressed sensing matrix is determined by the null space

$$\begin{equation} {\mathcal{N}}={\mathcal{N}}(\Phi ):=\{x\in \mathbb{R}^N: \Phi x=0\}. \cssId{nullspace}{\tag{1.12}} \end{equation}$$

The importance of ${\mathcal{N}}$ is that if we observe $y=\Phi x$ without any a priori information on $x$, the set of $z$ such that $\Phi z=y$ is given by the affine space

$$\begin{equation} {\mathcal{F}}(y):=x+{\mathcal{N}}. \cssId{affine}{\tag{1.13}} \end{equation}$$

We bring out the importance of the null space in §3 where we formulate a property of the null space which is necessary and sufficient for $\Phi$ to have a decoder $\Delta$ for which the instance optimality (Equation 1.11) holds.

We apply this property in §4 to the case $X=\ell _1$. In this case, we show the minimal number of measurements $n$ which ensures (Equation 1.11) is of the same order as $k$ up to a logarithmic factor. In that sense, compressed sensing performs almost as well as best $k$-term approximation. We also show that, similar to the work of Candès, Romberg, and Tao this is achieved with the decoder $\Delta$ defined by $\ell _1$ minimization. We should mention that our results in this section are essentially contained in the work of Candès, Romberg, and Tao Reference 5Reference 6Reference 4 and we build on their ideas.

We next treat the case $X=\ell _2$ in §5. In this case, the situation is much less in favor of compressed sensing, since the minimal number of measurements $n$ which ensures (Equation 1.11) is now of the same order as $N$.

In §6, we consider an important variant of the $\ell _2$ case where we ask for $\ell _2$ instance optimality in the sense of probability. Here, rather than requiring that (Equation 1.11) holds for all $x\in \mathbb{R}^N$, we ask only that for each given $x$ it holds with high probability. We shall see that in the case $X=\ell _2$ the minimal number of measurements $n$ for such results is dramatically reduced, down to the order given by condition (Equation 1.9). Moreover, we show that standard constructions of random matrices such as Gaussian and Bernoulli ensembles achieve this performance.

The striking contrast between the results of §5 and §6 shows that the probabilistic setting plays a crucial role in $\ell _2$ instance optimality. Similar results in the sense of probability have been obtained earlier in a series of paper Reference 7Reference 10Reference 11Reference 12 that reflect the theoretical computer science approach to compressed sensing, also known as data sketching. A comparison with our results is in order.

First, the instance optimality bounds obtained in these papers are quantitatively more precise than ours, since they have the general form

$$\begin{equation} \|x-\Delta (\Phi x)\|_{\ell ^2} \leq (1+\epsilon ) \sigma _k(x)_{\ell ^2}, \cssId{insteps}{\tag{1.14}} \end{equation}$$

where $\epsilon >0$ can be made arbitrarily small, at the expense of raising $n$, while in most of our results the constant $C_0$ in (Equation 1.11) cannot get arbitrarily close to $1$. On the other hand, for a fixed $\epsilon >0$, the ratio between $n$ and $k$ is generally not as good as in (Equation 1.9): for instance the decoders proposed in Reference 7 and Reference 11, respectively, use $n\sim \frac{k}{\epsilon }\log (N)^{5/2}$ and $n\sim \frac{k}{\epsilon ^3} \log (N)$ samples in order to achieve (Equation 1.14).

Secondly, the types of encoding matrices which are proposed in these papers are of fairly different nature than those which are considered in §6, and our analysis actually does not apply to these matrices. Let us mention that one specific interest of the Gaussian matrices which are considered in the present paper is that they give rise to an encoding which is “robust” with respect to a change of the basis in which the signal is sparse, since the product of such a $\Phi$ and any $N\times N$ unitary matrix $U$ results in a matrix $\tilde{\Phi }$ with the same probability law.

Finally, one of the significant achievements in Reference 7Reference 10Reference 11Reference 12 is the derivation of practical decoding algorithms of polynomial complexity in $k$ up to logarithmic factors, therefore typically faster than solving the $\ell _1$ minimization problem, while we do not propose any such algorithm in the present paper.

Generally speaking, an important issue in compressed sensing is the practical implementation of the decoder $\Delta$ by a fast algorithm. While being aware of this fact, the main goal of the present paper is to understand the theoretical limits of compressed sensing in comparison to nonlinear approximation. Therefore the main question that we address is, “How many measurements do we need so that some decoder recovers $x$ up to some prescribed tolerance?”, rather than, “What is the fastest algorithm which allows to recover $x$ from these measurements up to the same tolerance?”

The last sections of the paper are devoted to additional results which complete the theory. In order to limit the size of the paper, we only give a sketch of the proofs in those sections. The case $X=\ell _p$ for $1<p<2$ is treated in §7, and in §8 we discuss another type of estimate that we refer to as mixed-norm instance optimality. Here the estimate (Equation 1.11) is replaced by an estimate of the type

$$\begin{equation} \|x-\Delta (\Phi x)\|_{X} \leq C_0 k^{-s}\sigma _k(x)_Y, \cssId{mixinst}{\tag{1.15}} \end{equation}$$

where $Y$ differs from $X$ and $s>0$ is some relevant exponent. This type of estimate was introduced in Reference 4 in the particular case $X=\ell _2$ and $Y=\ell _1$. We give examples in the case $X=\ell _p$ and $Y=\ell _q$ in which mixed-norm estimates allow us to recover better approximation estimates for compressed sensing than (Equation 1.11).

2. Performance over classes

We begin by recalling some well-known results concerning best $k$-term approximation which we shall use in the course of this paper. Given a sequence norm $\|\cdot \|_X$ on $\mathbb{R}^N$ and a positive integer $r>0$, we define the approximation class ${\mathcal{A}}^r$ by means of

$$\begin{equation} \|x\|_{{\mathcal{A}}^r(X)}:= \max _{1\le k\le N} k^r\sigma _k(x)_X. \cssId{anorm}{\tag{2.1}} \end{equation}$$

Notice that since we are in a finite dimensional space $\mathbb{R}^N$, this (quasi-)norm will be finite for all $x\in \mathbb{R}^N$.

A simple, yet fundamental, chapter in $k$-term approximation is to connect the approximation norm in (Equation 2.1) with traditional sequence norms. For this, we define for any $0<q<\infty$, the weak $\ell _q$ norm as

$$\begin{equation} \|x\|_{w\ell _q}^q:= \sup _{\epsilon >0} \epsilon ^{q}\#\{i\; ;\; |x_i|>\epsilon \}. \cssId{weaknorm}{\tag{2.2}} \end{equation}$$

Again, for any $x\in \mathbb{R}^N$ all of these norms are finite.

If we fix the $\ell _p$ norm in which approximation error is to be measured, then for any $x\in \mathbb{R}^N$, we have for $q:=(r+1/p)^{-1}$,

$$\begin{equation} B_0\|x\|_{w\ell _q}\le \|x\|_{{\mathcal{A}}^r}\le B_1r^{-1/p}\|x\|_{w\ell _q},\quad x\in \mathbb{R}^N, \cssId{compare1}{\tag{2.3}} \end{equation}$$

for two absolute constants $B_0,B_1>0$. Notice that the constants in these inequalities do not depend on $N$. Therefore, $x\in {\mathcal{A}}^r$ is equivalent to $x\in w\ell _q$ with equivalent norms.

Since the $\ell _q$ norm is larger than the weak $\ell _q$ norm, we can replace the weak $\ell _q$ norm by the $\ell _q$ norm in the right inequality of (Equation 2.3). However, the constant can be improved via a direct argument. Namely, if $1/q=r+1/p$, then for any $x\in \mathbb{R}^N$,

$$\begin{equation} \sigma _k(x)_{\ell _p} \le \|x\|_{\ell _q}k^{-r},\quad k=1,2,\dots , N. \cssId{compare2}{\tag{2.4}} \end{equation}$$

To prove this, take $\Lambda$ as the set of indices corresponding to the $k$ largest entries in $x$. If $\epsilon$ is the size of the smallest entry in $\Lambda$, then $\epsilon \le \|x\|_{w\ell _q}k^{-1/q}\le \|x\|_{\ell _q}k^{-1/q}$ and therefore

$$\begin{equation} \sigma _k(x)_{\ell _p}^p = \sum _{i\notin \Lambda }|x_i|^{p} \le \epsilon ^{p-q} \sum _{i\notin \Lambda }|x_i|^{q}\le k^{-\frac{p-q}{q}}\|x\|_{\ell _q}^{p-q}\|x\|_{\ell _q}^{q}, \cssId{compare3}{\tag{2.5}} \end{equation}$$

so that (Equation 2.4) follows.

From this, we see that if we consider the class $K=U(\ell _q^N)$, we have

$$\begin{equation} \sigma _k(K)_{\ell _p} \leq k^{-r}, \cssId{sigmakK}{\tag{2.6}} \end{equation}$$

with $r=1/q-1/p$. On the other hand, taking $x\in K$ such that $x_i=(2k)^{-1/q}$ for $2k$ indices and $0$ otherwise, we find that

$$\begin{equation} \sigma _k(x)_{\ell ^p} = [k(2k)^{-p/q}]^{1/p}=2^{-1/q}k^{-r}, \tag{2.7} \end{equation}$$

so that $\sigma _k(K)_X$ can be framed by

$$\begin{equation} 2^{-1/q}k^{-r} \leq \sigma _k(K)_{\ell _p} \leq k^{-r}. \cssId{sigmakK2}{\tag{2.8}} \end{equation}$$

We next turn to the performance of compressed sensing over classes of vectors, by studying the quantity $E_n(K)_X$ defined by (Equation 1.6). As we have mentioned, the optimal performance of sensing algorithms is closely connected to the concept of Gelfand widths which are in some sense dual to the perhaps better known Kolmogorov widths. If $K$ is a compact set in $X$ and $n$ is a positive integer, then the Gelfand width of $K$ and of order $n$ is by definition

$$\begin{equation} d^n(K)_X := \inf _{Y}\sup \{\|x\|_X\; ;\; x\in K\cap Y\} \cssId{gelwidth}{\tag{2.9}} \end{equation}$$

where the infimum is taken over all subspaces $Y$ of $X$ of codimension less than or equal to $n$. This quantity is equivalent to $E_n(K)_X$, according to the following well-known result.

Lemma 2.1.

Let $K\subset \mathbb{R}^N$ be any set for which $K=-K$ and for which there is a $C_0>0$ such that $K+K\subset C_0K$. If $X\subset \mathbb{R}^N$ is any normed space, then

$$\begin{equation} d^n(K)_X\le E_{n}(K)_X\le C_0d^n(K)_X,\quad 1\le n\le N. \cssId{comparegel}{\tag{2.10}} \end{equation}$$

Proof.

We give a proof for completeness of this paper. We first remark that the null space $Y={\mathcal{N}}$ of $\Phi$ is of codimension less than or equal to $n$. Conversely, given any space $Y\subset \mathbb{R}^N$ of codimension $n$, we can associate its orthogonal complement $Y^\perp$ which is of dimension $n$ and the $n\times N$ matrix $\Phi$ whose rows are formed by any basis for $Y^\perp$. Through this identification, we see that

$$\begin{equation} d^n(K)_X=\inf _{\Phi } \sup \{\|\eta \|_X:\eta \in {\mathcal{N}}\cap K\}, \cssId{comp1}{\tag{2.11}} \end{equation}$$

where the infimum is taken over all $n\times N$ matrices $\Phi$.

Now, if ($\Phi ,\Delta$) is any encoder-decoder pair and $z=\Delta (0)$, then for any $\eta \in {\mathcal{N}}$, we also have $-\eta \in {\mathcal{N}}$. It follows that either $\|\eta -z\|_X\ge \|\eta \|_X$ or $\|-\eta -z\|_X\ge \|\eta \|_X$. Since $K=-K$, we conclude that

$$\begin{equation} d^n(K)_X\le \sup _{\eta \in {\mathcal{N}}\cap K}\|\eta -\Delta (\Phi \eta )\|_X. \cssId{comp2}{\tag{2.12}} \end{equation}$$

Taking an infimum over all encoder-decoder pairs in ${\mathcal{A}}_{n,N}$, we obtain the left inequality in (Equation 2.10).

To prove the right inequality, we choose an optimal $Y$ for $d^n(K)_X$ and use the matrix $\Phi$ associated to $Y$ (i.e., the rows of $\Phi$ are a basis for $Y^\perp$). We define a decoder $\Delta$ for $\Phi$ as follows. Given $y$ in the range of $\Phi$, we recall that ${\mathcal{F}}(y)$ is the set of $x$ such that $\Phi x=y$. If ${\mathcal{F}}(y)\cap K\neq \emptyset$, we take any $\bar{x}(y)\in {\mathcal{F}}(y)\cap K$ and define $\Delta (y):=\bar{x}(y)$. When ${\mathcal{F}}(y)\cap K=\emptyset$, we define $\Delta (y)$ as any element from ${\mathcal{F}}(y)$. This gives

$$\begin{equation} E_n(K)_X \le \sup _{x,x'\in {\mathcal{F}}(y)\cap K}\|x-x'\|_X\le \sup _{\eta \in C_0[K\cap {\mathcal{N}}]}\|\eta \|_X\le C_0d^n(K)_X, \cssId{comp3}{\tag{2.13}} \end{equation}$$

where we have used the fact that $x-x'\in {\mathcal{N}}$ and $x-x'\in C_0K$ by our assumptions on $K$. This proves the right inequality in (Equation 2.10).

■

The orders of the Gelfand widths of $\ell _q$ balls in $\ell _p$ are known except perhaps for the case $q=1, p=\infty$. For the range of $p,q$ that is relevant here even the constants are known. We recall the following results of Gluskin, Garnaev and Kashin which can be found in Reference 13Reference 9Reference 14; see also Reference 15. For $K=U(\ell _q^N)$, we have

$$\begin{equation} C_1\Psi (n,N,q,p)\le d^n(K)_{\ell _p}\le C_2\Psi (n,N,q,p), \cssId{gelwidths}{\tag{2.14}} \end{equation}$$

where $C_1,C_2$ only depend on $p$ and $q$ and where

$$\begin{equation} \Psi (n,N,q,p):= [\min (1, N^{1-1/q}n^{-1/2})]^{\frac{1/q-1/p}{1/q-1/2}}, \quad 1\le n\le N, \ 1< q< p\le 2, \cssId{psi}{\tag{2.15}} \end{equation}$$

and

$$\begin{equation} \Psi (n,N,1,2):= \min \,\left\{1, \sqrt {\frac{\log (N/n)}{n}}\right\}. \cssId{psi1}{\tag{2.16}} \end{equation}$$

Since $K=U(\ell _q^N)$ obviously satisfies the assumptions of Lemma 2.1 with $C_0=2$, we also have

$$\begin{equation} C_1\Psi (n,N,q,p)\le E_n(K)_{\ell _p}\le 2C_2 \Psi (n,N,q,p). \cssId{gelwidths2}{\tag{2.17}} \end{equation}$$

From (Equation 2.14), (Equation 2.16), (Equation 2.10) we deduce indeed the announced fact that $E_n(U(\ell _1^N))_{\ell _2} \leq C_0\sigma _k(U(\ell _1^N))_{\ell _2}$ can only hold when $k$ and the necessary number of measurements $n$ are interrelated by (Equation 1.9). The possible range of $k$ for which even instance optimality could hold is therefore also limited by (Equation 1.9), a relation that will turn up frequently in what follows.

3. Instance optimality and the null space of $\Phi$

We now turn to the main question addressed in this paper, namely the study of instance optimality as expressed by (Equation 1.11). In this section, we shall see that (Equation 1.11) can be reformulated as a property of the null space ${\mathcal{N}}$ of $\Phi$. As was already remarked in the proof of Lemma 2.1, this null space has codimension not larger than $n$.

We shall also need to consider sections of $\Phi$ obtained by keeping some of its columns: for $T\subset \{1,\dots ,N\}$, we denote by $\Phi _T$ the $n\times \#T$ matrix formed from the columns of $\Phi$ with indices in $T$. Similarly we shall have to deal with restrictions $x_T$ of vectors $x\in \mathbb{R}^N$ to sets $T$. However, it will be convenient to view such restrictions still as elements of $\mathbb{R}^N$, i.e., $x_T$ agrees with $x$ on $T$ and has all components equal to zero whose indices do not belong to $T$.

We begin by studying under what circumstances the measurement vector $y=\Phi x$ uniquely determines each $k$-sparse vector $x\in \Sigma _k$. This is expressed by the following trivial lemma.

Lemma 3.1.

If $\Phi$ is any $n\times N$ matrix and $2k\le n$, then the following are equivalent:

(i) There is a decoder $\Delta$ such that $\Delta (\Phi x)=x$, for all $x\in \Sigma _k$.

(ii) $\Sigma _{2k}\cap {\mathcal{N}}=\{0\}$.

(iii) For any set $T$ with $\#T=2k$, the matrix $\Phi _T$ has rank $2k$.

(iv) The symmetric nonnegative matrix $\Phi _T^t\Phi _T$ is invertible, i.e., positive definite.

Proof.

The equivalence of (ii), (iii), (iv) is linear algebra.

(i)$\Rightarrow$(ii): Suppose (i) holds and $x\in \Sigma _{2k}\cap {\mathcal{N}}$. We can write $x=x_0-x_1$ where both $x_0,x_1\in \Sigma _k$. Since $\Phi x_0=\Phi x_1$, we have, by (i), that $x_0=x_1$ and hence $x=x_0-x_1=0$.

(ii)$\Rightarrow$(i): Given any $y\in \mathbb{R}^n$, we define $\Delta (y)$ to be any element in ${\mathcal{F}}(y)$ with smallest support. Now, if $x_1,x_2\in \Sigma _k$ with $\Phi x_1=\Phi x_2$, then $x_1-x_2\in {\mathcal{N}}\cap \Sigma _{2k}$. From (ii), this means that $x_1=x_2$. Hence, if $x\in \Sigma _k$, then $\Delta (\Phi x)=x$ as desired.

■

The properties discussed in Lemma 3.1 are algebraic properties of $\Phi$. If $N,k$ are fixed, the question arises as to how large we need to make $n$ so that there is a matrix $\Phi$ having the properties of the lemma. It is easy to see that we can take $n=2k$. Indeed, for any $k$ and $N\ge 2k$, we can find a set $\Lambda _N$ of $N$ vectors in $\mathbb{R}^{2k}$ such that any $2k$ of them are linearly independent. For example if $0<x_1<x_2<\cdots <x_N$, then the matrix whose $(i,j)$ entry is $x_j^{i-1}$ has the properties of Lemma 3.1. Its $2k\times 2k$ minors are Vandermonde matrices which are well known to be nonsingular. Unfortunately, such matrices are poorly conditioned when $N$ is large and the process of recovering $x\in \Sigma _k$ from $y=\Phi x$ is therefore numerically unstable.

Stable recovery procedures have been proposed by Candès, Romberg, and Tao and by Donoho under stronger conditions on $\Phi$. We shall make heavy use in this paper of the following property introduced by Candès and Tao. We say that $\Phi$ satisfies the restricted isometry property (RIP) of order $k$ if there is a $0<\delta _k<1$ such that

$$\begin{equation} (1-\delta _k) \|z\|_{\ell _2}\le \|\Phi _T z\|_{\ell _2}\le (1+\delta _k)\|z\|_{\ell _2},\nobreakspace\nobreakspace\nobreakspace z\in \mathbb{R}^N, \cssId{cs1}{\tag{3.1}} \end{equation}$$

holds for all $T$ of cardinality $k$.⁠Footnote¹ The RIP condition is equivalent to saying that the symmetric matrix $\Phi _T^t\Phi _T$ is positive definite with eigenvalues in $[(1-\delta _k)^2,(1+\delta _k)^2]$. Note that RIP of order $k$ always implies RIP of order $l\leq k$. Note also that RIP of order $2k$ guarantees that the properties of Lemma 3.1 hold.

The RIP condition could be replaced by the assumption that $C_0\|z\|_{\ell _2}\le \|\Phi _T z\|_{\ell _2}\le C_1\|z\|_{\ell _2}$ holds for all $\#(T)=k$, with absolute constants $C_0,C_1$ in all that follows. However, this latter condition is equivalent to having a rescaled matrix $\alpha \Phi$ satisfy RIP for some $\alpha$ and the rescaled matrix extracts exactly the same information from a vector $x$ as $\Phi$ does.

✖

Candès and Tao have shown that any matrix $\Phi$ which satisfies the RIP property for $k$ and sufficiently small $\delta _k$ will extract enough information about $x$ to approximate it well and moreover the decoding can be done by $\ell _1$ minimization. The key question then is, given a fixed $n,N$, how large can we take $k$ and still have matrices which satisfy RIP for $k$? It was shown by Candès and Tao Reference 5, as well as Donoho Reference 8, that certain families of random matrices will, with high probability, satisfy RIP of order $k$ with $\delta _k \leq \delta <1$ for some prescribed $\delta$ independent of $N$ provided $k\le c_0n/\log (N/k)$. Here $c_0$ is a constant which when made small will make $\delta _k$ small as well. It should be stressed that all available constructions of such matrices (so far) involve random variables. For instance, as we shall recall in more detail in §6, the entries of $\Phi$ can be picked as i.i.d. Gaussian or Bernoulli variables with proper normalization.

We turn to the question of whether $y$ contains enough information to approximate $x$ to accuracy $\sigma _k(x)$ as expressed by (Equation 1.11). The following theorem shows that this can be understood through the study of the null space ${\mathcal{N}}$ of $\Phi$.

Theorem 3.2.

Given an $n\times N$ matrix $\Phi$, a norm $\|\cdot \|_X$ and a value of $k$, then a sufficient condition that there exists a decoder $\Delta$ such that (Equation 1.11) holds with constant $C_0$ is that

$$\begin{equation} \|\eta \|_X\le \frac{C_0}{2}\sigma _{2k}(\eta )_X,\quad \eta \in {\mathcal{N}}. \cssId{null}{\tag{3.2}} \end{equation}$$

A necessary condition is that

$$\begin{equation} \|\eta \|_X\le C_0\sigma _{2k}(\eta )_X,\quad \eta \in {\mathcal{N}}. \cssId{null1}{\tag{3.3}} \end{equation}$$

Proof.

To prove the sufficiency of (Equation 3.2), we will define a decoder $\Delta$ for $\Phi$ as follows. Given any $y\in \mathbb{R}^N$, we consider the set ${\mathcal{F}}(y)$ and choose

$$\begin{equation} \Delta (y):=\operatorname *{Argmin}_{z\in {\mathcal{F}}(y)}\sigma _k(z)_X. \cssId{ta}{\tag{3.4}} \end{equation}$$

We shall prove that for all $x\in \mathbb{R}^N$

$$\begin{equation} \|x-\Delta (\Phi x)\|_X\le C_0\sigma _k(x)_X. \cssId{prove}{\tag{3.5}} \end{equation}$$

Indeed, $\eta :=x-\Delta (\Phi x)$ is in ${\mathcal{N}}$ and hence by (Equation 3.2), we have

$$\begin{array}{lll} \|x-\Delta (\Phi x)\|_X &\le (C_0/2)\sigma _{2k}(x-\Delta (\Phi x))_X\\&\le (C_0/2)(\sigma _k(x)_X+\sigma _k(\Delta (\Phi x)_X)\\&\le C_0\sigma _k(x)_X, \end{array}$$

where the second inequality uses the fact that $\sigma _{2k}(x+z)_X \leq \sigma _k(x)_X+\sigma _k(z)_X$ and the last inequality uses the fact that $\Delta (\Phi x)$ minimizes $\sigma _k(z)$ over ${\mathcal{F}}(y)$.

To prove the necessity of (Equation 3.3), let $\Delta$ be any decoder for which (Equation 1.11) holds. Let $\eta$ be any element in ${\mathcal{N}}={\mathcal{N}}(\Phi )$ and let $\eta _0$ be the best $2k$-term approximation of $\eta$ in $X$. Letting $\eta _0=\eta _1+\eta _2$ be any splitting of $\eta _0$ into two vectors of support size $k$, we can write

$$\begin{equation} \eta =\eta _1+\eta _2+\eta _3, \tag{3.6} \end{equation}$$

with $\eta _3=\eta -\eta _0$. Since $-\eta _1\in \Sigma _k$, we have by (Equation 1.11) that $-\eta _1=\Delta (\Phi (-\eta _1))$, but since $\eta \in {\mathcal{N}}$, we also have $-\Phi \eta _1=\Phi (\eta _2+\eta _3)$ so that $-\eta _1 =\Delta (\Phi (\eta _2 +\eta _3))$. Using again (Equation 1.11), we derive

$$\begin{eqnarray*} \|\eta \|_X & = &\|\eta _2+\eta _3 -\Delta (\Phi (\eta _2 +\eta _3))\|_X \leq C_0\sigma _k(\eta _2+\eta _3)\\ & \leq & C_0\|\eta _3\|_X = C_0\sigma _{2k}(\eta ), \end{eqnarray*}$$

which is (Equation 3.3).

■

When $X$ is an $\ell _p$ space, the best $k$-term approximation is obtained by leaving the $k$ largest components of $x$ unchanged and setting all the others to $0$. Therefore the property

$$\begin{equation} \|\eta \|_X \leq C\sigma _k(\eta )_X \tag{3.7} \end{equation}$$

can be reformulated by saying that

$$\begin{equation} \|\eta \|_X \leq C\|\eta _{T^c}\|_X \cssId{NSPX}{\tag{3.8}} \end{equation}$$

holds for all $T\subset \{1,\ldots ,N\}$ such that $\#T\leq k$, where $T^c$ is the complement set of $T$ in $\{1,\ldots ,N\}$. In going further, we shall say that $\Phi$ has the null space property in $X$ of order $k$ with constant $C$ if (Equation 3.8) holds for all $\eta \in {\mathcal{N}}$ and $\#T\leq k$. Thus, we have

Corollary 3.3.

Suppose that $X$ is an $\ell _p^N$ space, $k>0$ an integer and $\Phi$ an encoding matrix. If $\Phi$ has the null space property (Equation 3.8) in $X$ of order $2k$ with constant $C_0/2$, then there exists a decoder $\Delta$ so that $(\Phi ,\Delta )$ satisfies (Equation 1.11) with constant $C_0$. Conversely, the validity of (Equation 1.11) for some decoder $\Delta$ implies that $\Phi$ has the null space property (Equation 3.8) in $X$ of order $2k$ with constant $C_0$.

In the next two sections, we shall use this corollary in order to study instance optimality in the case where the $X$ norm is $\ell _1$ and $\ell _2$, respectively.

4. The case $X=\ell _1$

In this section, we shall study the null space property (Equation 3.8) in the case where $X=\ell _1$. We shall make use of the restricted isometry property (Equation 3.1) introduced by Candès and Tao. We begin with the following lemma whose proof is inspired by results in Reference 4.

Lemma 4.1.

Let $a=\ell /k$, $b=\ell '/k$ with $\ell ,\ell '\geq k$ integers. If $\Phi$ is any matrix which satisfies the RIP of order $(a+b)k$ with $\delta =\delta _{(a+b)k} <1$. Then $\Phi$ satisfies the null space property in $\ell _1$ of order $ak$ with constant $C_0= 1+\frac{\sqrt {a}(1+\delta )}{\sqrt {b}(1-\delta )}$.

Proof.

It is enough to prove (Equation 3.8) in the case when $T$ is the set of indices of the largest $ak$ entries of $\eta$. Let $T_0=T$, $T_1$ denote the set of indices of the next $bk$ largest entries of $\eta$, $T_2$ the next $bk$ largest, and so on. The last set $T_s$ defined this way may have less than $bk$ elements.

We define $\eta _0:=\eta _{T_0} +\eta _{T_1}$. Since $\eta \in {\mathcal{N}}$, we have $\Phi \eta _0=-\Phi (\eta _{T_2}+\dots +\eta _{T_s})$, so that

$$\begin{eqnarray*} \|\eta _{T}\|_{\ell _2} & \le &\|\eta _0\|_{\ell _2} \leq (1-\delta )^{-1} \|\Phi \eta _0\|_{\ell _2} =(1-\delta )^{-1} \|\Phi (\eta _{T_2}+\dots +\eta _{T_s})\|_{\ell _2}\\ & \le & (1-\delta )^{-1} \sum _{j=2}^s\|\Phi \eta _{T_j}\|_{\ell _2} \le (1+\delta ) (1-\delta )^{-1} \sum _{j=2}^s\|\eta _{T_j}\|_{\ell _2} , \end{eqnarray*}$$

where we have used both bounds in (Equation 3.1). Now for any $i\in T_{j+1}$ and $i'\in T_j$, we have $|\eta _i|\leq |\eta _{i'}|$ so that $|\eta _i| \leq (bk)^{-1}\|\eta _{T_j}\|_{\ell _1}$. It follows that

$$\begin{equation} \|\eta _{T_{j+1}}\|_{\ell _2} \leq (bk)^{-1/2}\|\eta _{T_j}\|_{\ell _1},\quad j=1,2,\dots ,s-1, \tag{4.1} \end{equation}$$

so that

$$\begin{equation} \begin{aligned} \|\eta _{T}\|_{\ell _2}& \leq (1+\delta ) (1-\delta )^{-1} (bk)^{-1/2}\sum _{j=1}^{s-1}\|\eta _{T_j}\|_{\ell _1}\\ & \leq (1+\delta ) (1-\delta )^{-1}(bk)^{-1/2} \|\eta _{T^c}\|_{\ell _1}. \end{aligned} \tag{4.2} \end{equation}$$

By the Cauchy-Schwartz inequality $\|\eta _T\|_{\ell _1} \leq (ak)^{1/2}\|\eta _{T}\|_{\ell _2}$, and we therefore obtain

$$\begin{equation} \|\eta \|_{\ell _1}=\|\eta _T\|_{\ell _1}+\|\eta _{T^c}\|_{\ell _1}\le (1+\frac{\sqrt {a}(1+\delta )}{\sqrt {b}(1-\delta )})\|\eta _{T^c}\|_{\ell _1}, \cssId{nsp111}{\tag{4.3}} \end{equation}$$

which verifies the null space property with the constant $C_0$.

■

Combining Corollary 3.3 and Lemma 4.1 (with $a=2$ and $b=1$), we have therefore proved the following.

Theorem 4.2.

Let $\Phi$ be any matrix which satisfies the RIP of order $3k$. Define the decoder $\Delta$ for $\Phi$ as in (Equation 3.4) for $X=\ell _1$. Then (Equation 1.11) holds in $X=\ell _1$ with constant $C_0=2(1+\sqrt {2}\frac{1+\delta }{1-\delta })$. Generally speaking, we cannot derive a constant of the type $1+\epsilon$ from an analysis based on Lemma 4.1, since it requires that the null space property holds with constant $C_0/2$ which is therefore larger than $1$.

As was mentioned in the previous section, one can build matrices $\Phi$ which satisfy the RIP of order $k$ under the condition $n\geq ck\log (N/n)$ where $c$ is some fixed constant. We therefore conclude that instance optimality of order $k$ in the $\ell _1$ norm can be achieved at the price of ${\mathcal{O}}(k\log (N/n))$ measurements.

Remark 4.3.

More generally, if $\Phi$ satisfies the RIP of order $(2+b)k$ and $\Delta$ is defined by (Equation 3.4) for $X=\ell _1$, then (Equation 1.11) holds in $X=\ell _1$ with constant $C_0=2(1+\sqrt {2/b}\frac{1+\delta }{1-\delta })$. Therefore, if we make $b$ large, the constant $C_0$ in (Equation 1.11) is of the type $2+\epsilon$ under a condition of the type $n\geq c\frac{k}{\epsilon ^2}\log (N/n)$.

Note that on the other hand, since instance optimality of order $k$ in any norm $X$ always implies that the reconstruction is exact when $x\in \Sigma _k$, it cannot be achieved with less than $2k$ measurements according to Lemma 3.1.

Before addressing the $\ell _2$ case, let us briefly discuss the decoder $\Delta$ which achieves (Equation 1.11) for such a $\Phi$. According to the proof of Theorem 3.2, one can build $\Delta$ as the solution of the minimization problem (Equation 3.4). It is not clear to us whether this minimization problem can be solved in polynomial time in $N$. The following result shows that it is possible to define $\Delta$ by $\ell _1$ minimization if $\Phi$ satisfies the RIP with some additional control on the constants in (Equation 3.1).

Theorem 4.4.

Let $\Phi$ be any matrix which satisfies the RIP of order $3k$ with $\delta _{3k}\leq \delta < (\sqrt {2}-1)^2/3$. Define the decoder $\Delta$ for $\Phi$ as in (Equation 1.10). Then, $(\Phi ,\Delta )$ satisfies (Equation 1.11) in $X=\ell _1$ with $C_0=\frac{2\sqrt {2}+2-(2\sqrt {2}-2)\delta }{\sqrt {2}-1-(\sqrt {2}+1)\delta }$.

Proof.

We apply Lemma 4.1 with $a=1$, $b=2$ to see that $\Phi$ satisfies the null space property in $\ell _1$ of order $k$ with constant $C= 1+\frac{1+\delta }{\sqrt {2}(1-\delta )} < 2$. This means that for any $\eta \in {\mathcal{N}}$ and $T$ such that $\#T\leq k$, we have

$$\begin{equation} \|\eta \|_{\ell _1} \leq C \|\eta _{T^c}\|_{\ell _1}, \cssId{nulell1}{\tag{4.4}} \end{equation}$$

and therefore

$$\begin{equation} \|\eta _T\|_{\ell _1} \leq (C-1) \|\eta _{T^c}\|_{\ell _1}. \cssId{C0-1}{\tag{4.5}} \end{equation}$$

Let $x^*=\Delta (\Phi x)$ be the solution of (Equation 1.10) so that $\eta =x^*-x \in {\mathcal{N}}$ and

$$\begin{equation} \|x^*\|_{\ell _1} \leq \|x\|_{\ell _1}. \tag{4.6} \end{equation}$$

Denoting by $T$ the set of indices of the largest $k$ coefficients of $x$, we can write

$$\begin{equation} \|x^*_{T}\|_{\ell _1}+\|x^*_{T^c}\|_{\ell _1}\leq \|x_{T}\|_{\ell _1}+\|x_{T^c}\|_{\ell _1}. \tag{4.7} \end{equation}$$

It follows that

$$\begin{equation} \|x_{T}\|_{\ell _1}- \|\eta _{T}\|_{\ell _1}+ \|\eta _{T^c}\|_{\ell _1}-\|x_{T^c}\|_{\ell _1} \leq \|x_{T}\|_{\ell _1}+\|x_{T^c}\|_{\ell _1}, \tag{4.8} \end{equation}$$

and therefore

$$\begin{equation} \|\eta _{T^c}\|_{\ell _1} \le \|\eta _{T}\|_{\ell _1}+2 \|x_{T^c}\|_{\ell _1} = \|\eta _{T}\|_{\ell _1}+2\sigma _k(x)_{\ell _1}. \tag{4.9} \end{equation}$$

Using (Equation 4.5) and the fact that $C<2$, we thus obtain

$$\begin{equation} \|\eta _{T^c}\|_{\ell _1}\leq \frac{2}{2-C}\sigma _k(x)_{\ell _1}. \tag{4.10} \end{equation}$$

We finally use again (Equation 4.4) to conclude that

$$\begin{equation} \|x-x^*\|_{\ell _1} \leq \frac{2C}{2-C}\sigma _k(x)_{\ell _1}, \tag{4.11} \end{equation}$$

which is the announced result.

■

5. The case $X=\ell _2$

In this section, we shall show that instance optimality is not a very viable concept in $X=\ell _2$ in the sense that it will not even hold for $k=1$ unless $n\ge cN$. We know from Corollary 3.3 that if $\Phi$ is a matrix of size $n\times N$ which satisfies

$$\begin{equation} \|x-\Delta (\Phi x)\|_{\ell _2}\le C_0\sigma _k(x)_{\ell _2},\quad x\in \mathbb{R}^N, \cssId{decode1}{\tag{5.1}} \end{equation}$$

for some decoder $\Delta$, then its null space ${\mathcal{N}}$ will need to have the property

$$\begin{equation} \|\eta \|^2_{\ell _2}\le C_0^2\|\eta _{T^c}\|^2_{\ell _2},\quad \# T\le 2k. \cssId{NSP11}{\tag{5.2}} \end{equation}$$

Theorem 5.1.

For any matrix $\Phi$ of dimension $n\times N$, property (Equation 5.2) with $k=1$ implies that $N\leq C_0^2n$.

Proof.

We start from (Equation 5.2) with $k=1$ from which we trivially derive

$$\begin{equation} \|\eta \|^2_{\ell _2}\le C_0^2\|\eta _{T^c}\|^2_{\ell _2},\quad \# T\le 1, \cssId{NSP11p}{\tag{5.3}} \end{equation}$$

or equivalently for all $j\in \{1,\ldots ,N\}$,

$$\begin{equation} \sum _{i=1}^N|\eta _i|^2 \leq C_0^2\sum _{i\neq j} |\eta _i|^2. \tag{5.4} \end{equation}$$

From this, we derive that for all $j\in \{1,\ldots , N\}$,

$$\begin{equation} |\eta _j|^2\leq (C_0^2-1)\sum _{i\neq j} |\eta _i|^2 = (C_0^2-1)(\|\eta \|_{\ell _2}^2-|\eta _j|^2), \tag{5.5} \end{equation}$$

and therefore

$$\begin{equation} |\eta _j|^2\le A\|\eta \|_{\ell _2}^2, \cssId{nsp1}{\tag{5.6}} \end{equation}$$

with $A=1-\frac{1}{C_0^2}$.

Let $(e_j)_{j=1,\ldots ,N}$ be the canonical basis of $\mathbb{R}^N$ so that $\eta _j=\langle \eta ,e_j\rangle$ and let $v_1,\dots , v_{N-n}$ be an orthonormal basis for ${\mathcal{N}}$. Denoting by $P=P_{\mathcal{N}}$ the orthognal projection onto ${\mathcal{N}}$, we apply (Equation 5.6) to $\eta :=P(e_j)\in {\mathcal{N}}$ and find that for any $j\in \{1,\dots , N\}$

$$\begin{equation} |\langle P(e_j),e_j\rangle |^2 \le A. \cssId{have1}{\tag{5.7}} \end{equation}$$

This means

$$\begin{equation} \sum _{i=1}^{N-n}|\langle e_j,v_i\rangle |^2 \le A,\quad j=1,\dots ,N. \cssId{have2}{\tag{5.8}} \end{equation}$$

We sum (Equation 5.8) over $j\in \{1,\dots ,N\}$ and find

$$\begin{equation} N-n= \sum _{i=1}^{N-n}\|v_i\|_{\ell _2}^2\le AN. \cssId{have3}{\tag{5.9}} \end{equation}$$

It follows that $(1-A)N\leq n$. That is, $N\leq nC_0^2$ as desired.

■

The above result means that when measuring the error in $\ell _2$, the comparison between compressed sensing and best $k$-term approximation on a general vector of $\mathbb{R}^n$ is strongly in favor of best $k$-term approximation. However, this conclusion should be moderated in two ways. On the one hand, we shall see in §8 that one can obtain mixed-norm estimates of the form (Equation 1.15) from which one finds that compressed sensing compares favorably with best $k$-term approximation over sufficiently concentrated classes of vectors. On the other hand, we shall prove in the next section that (Equation 5.1) can be achieved with $n$ of the same order as $k$ up to a logarithmic factor, if one accepts that this result holds with high probability.

6. The case $X=\ell _2$ in probability

In order to formulate the results of this section, we let $\Omega$ be a probability space with probability measure $P$ and let $\Phi =\Phi (\omega )$, $\omega \in \Omega$, be an $n\times N$ random matrix. We seek results of the following type: for any $x\in \mathbb{R}^N$, if we draw $\Phi$ at random with respect to $P$, then

$$\begin{equation} \|x-\Delta (\Phi x)\|_{\ell _2}\le C_0\sigma _k(x)_{\ell _2} \cssId{decode2}{\tag{6.1}} \end{equation}$$

holds for this particular $x$ with high probability for some decoder $\Delta$ (dependent on the draw $\Phi$). We shall even give explicit decoders which will yield this type of inequality. It should be understood that $\Phi$ is drawn independently for each $x$ in contrast to building a $\Phi$ such that (Equation 6.1) holds simultaneously for all $x\in \mathbb{R}^N$, which was our original definition of instance optimality.

Two simple instances of random matrices which are often considered in compressed sensing are

(1): Gaussian matrices: $\Phi _{i,j}={\mathcal{N}}(0,\frac{1}{n})$ are i.i.d. Gaussian variables of variance $1/n$,
(2): Bernoulli matrices: $\Phi _{i,j}=\frac{\pm 1}{\sqrt n}$ are i.i.d. Bernoulli variables of variance $1/n$.

In order to establish such results, we shall need that the random matrix $\Phi$ has two properties which we now describe. The first of these relates to the restricted isometry property which we know plays a fundamental role in the performance of the matrix $\Phi$ in compressed sensing.

Definition 6.1.

We say that the random matrix $\Phi$ satisfies RIP of order $k$ with constant $\delta$ and probability $1-\epsilon$ if there is a set $\Omega _0\subset \Omega$ with $P(\Omega _0)\geq 1-\epsilon$ such that for all $\omega \in \Omega _0$, the matrix $\Phi (\omega )$ satisfies (Equation 3.1) with constant $\delta _k\leq \delta$.

This property has been shown for random matrices of the above Gaussian or Bernoulli type. Namely, given any $c>0$ and $\delta >0$, there is a constant $c_0>0$ such that for all $n\geq c_0k \log (N/n)$ this property will hold with $\epsilon \le e^{-cn}$; see Reference 2Reference 5Reference 8Reference 16.

The RIP controls the behavior of $\Phi$ on $\Sigma _k$, or equivalently on all the $k$ dimensional spaces spanned by any subset of $\{e_1,\ldots ,e_N\}$ of cardinality $k$. On the other hand, for a general vector $x\in \mathbb{R}^N$, the image vector $\Phi x$ might have a much larger norm than $x$. However, for standard constructions of random matrices the probability that $\Phi x$ has large norm is small. We formulate this by the following definition.

Definition 6.2.

We say that the random matrix $\Phi$ has the boundedness property with constant $C$ and probability $1-\epsilon$ if for each $x\in \mathbb{R}^N$, there is a set $\Omega _0(x)\subset \Omega$ with $P(\Omega _0(x))\geq 1-\epsilon$ such that for all $\omega \in \Omega _0(x)$,

$$\begin{equation} \|\Phi (\omega )x\|_{\ell _2}\leq C\|x\|_{\ell _2}. \cssId{texmlid2}{\tag{6.2}} \end{equation}$$

Note that the property which is required in this definition is clearly weaker than asking that the spectral norm $\|\Phi \|:=\sup _{\|x\|_{\ell _2}=1}\|\Phi x\|_{\ell _2}$ be not greater than $C$ with probability $1-\epsilon$.

Again, this property has been shown for various random families of matrices and in particular for the Gaussian or Bernoulli families. Namely, given any $C>1$, this property will hold with constant $C$ and $\epsilon \le 2e^{-\beta n}$ with $\beta =\beta (C)>0$; see Reference 1 or the discussion in Reference 2. Thus, the standard constructions of random matrices will satisfy both of these properties.

We now describe our process for decoding $y=\Phi x$, when $\Phi =\Phi (\omega )$ is our given realization of the random matrix. Let $T\subset \{1,\dots ,N\}$ be any subset of column indices with $\#(T)=k$ and let $X_T$ be the linear subspace of $\mathbb{R}^N$ which consists of all vectors supported on $T$. For this $T$, we define

$$\begin{equation} x_T^*:= \operatorname *{Argmin}_{z\in X_T}\|\Phi z-y\|_{\ell _2}. \cssId{Tsol}{\tag{6.3}} \end{equation}$$

In other words, $x_T^*$ is chosen as the least squares minimizer of the residual in approximation by elements of $X_T$. Notice that $x_T^*$ is supported on $T$. If $\Phi$ satisfies RIP of order $k$, then the matrix $\Phi _T^t\Phi _T$ is nonsingular and the nonzero entries of $x^*_T$ are given by

$$\begin{equation} (\Phi _T^t\Phi _T)^{-1}\Phi _T^t y. \cssId{texmlid3}{\tag{6.4}} \end{equation}$$

To decode $y$, we search over all subsets $T$ of cardinality $k$ and choose

$$\begin{equation} T^*:=\operatorname *{Argmin}_{\#(T)=k}\|y-\Phi x^*_T\|_{\ell _2^n}. \cssId{optT}{\tag{6.5}} \end{equation}$$

Our decoding of $y$ is now given by

$$\begin{equation} x^*= \Delta (y):= x^*_{T^*}. \cssId{texmlid1}{\tag{6.6}} \end{equation}$$

The main result of this section is the following.

Theorem 6.3.

Assume that $\Phi$ is a random matrix which satisfies RIP of order $2k$ with constant $\delta$ and probability $1-\epsilon$ and also satisfies the boundedness property with constant $C$ and probability $1-\epsilon$. Then, for each $x\in \mathbb{R}^N$, there exists a set $\Omega (x)\subset \Omega$ with $P(\Omega (x))\geq 1-2\epsilon$ such that for all $\omega \in \Omega (x)$ and $\Phi =\Phi (\omega )$, the estimate (Equation 6.1) holds with $C_0=1+\frac{2C}{1-\delta }$. Here the decoder $\Delta =\Delta (\omega )$ is given by (Equation 6.6).

Proof.

Let $x\in \mathbb{R}^N$ be arbitrary and let $\Phi =\Phi (\omega )$ be the draw of the matrix $\Phi$ from the random ensemble. We denote by $T$ the set of indices corresponding to the $k$ largest coefficients of $x$. Thus

$$\begin{equation} \|x-x_T\|_{\ell _2}=\sigma _k(x)_{\ell _2}. \cssId{kterm1}{\tag{6.7}} \end{equation}$$

We consider the set $\Omega ':=\Omega _0\cap \Omega (x-x_T)$ where $\Omega _0$ is the set in the definition of RIP in probability and $\Omega (x-x_T)$ is the set in the definition of boundedness in probability for the vector $x-x_T$. Then $P(\Omega ')\ge 1-2\epsilon$. For any $\omega \in \Omega '$, we have

$$\begin{equation} \|x-x^*\|_{\ell _2}\le \|x-x_T\|_{\ell _2}+\|x_T-x^*\|_{\ell _2}\le \sigma _k(x)_{\ell _2}+\|x_T-x^*\|_{\ell _2}. \cssId{pest1}{\tag{6.8}} \end{equation}$$

We bound the second term by

$$\begin{array}{ll} \|x_T-x^*\|_{\ell _T^N }&\le (1-\delta )^{-1}\|\Phi (x_T-x^*)\|_{\ell _2} \\& \le (1-\delta )^{-1}(\|\Phi (x-x_T)\|_{\ell _2}+\|\Phi (x-x^*)\|_{\ell _2}) \\&=(1-\delta )^{-1}(\|y-\Phi x_T\|_{\ell _2}+\|y-\Phi x^*\|_{\ell _2}) \\&\leq 2(1-\delta )^{-1}\|y-\Phi x_T\|_{\ell _2} =2(1-\delta )^{-1}\|\Phi (x-x_T)\|_{\ell _2} \\& \leq 2C(1-\delta )^{-1}\|x-x_T\|_{\ell _2}=2C(1-\delta )^{-1}\sigma _k(x)_{\ell _2}, \end{array}$$

where the first inequality uses the RIP and the fact that $x_T-x^*$ is a vector with support of size less than $2k$, the third inequality uses the minimality of $T^*$ and the fourth inequality uses the boundedness property in probability for $x-x_T$.

■

By virtue of the remarks on the properties of Gaussian and Bernoulli matrices, we derive the following quantitative result.

Corollary 6.4.

If $\Phi$ is a random matrix of either Gaussian or Bernoulli type, then for any $\epsilon >0$ and $C_0>3$, there exists a constant $c_0$ such that if $n\geq c_0 k \log (N/n)$, the following holds: for every $x\in \mathbb{R}^N$, there exists a set $\Omega (x)\subset \Omega$ with $P(\Omega (x))\geq 1-2\epsilon$ such that (Equation 6.1) holds for all $\omega \in \Omega (x)$ and $\Phi =\Phi (\omega )$.

Remark 6.5.

Our analysis yields a constant of the form $C_0=3+\eta$, where $\eta$ can be made arbitraritly small at the expense of raising $n$, and it is not clear to us how to improve this constant down to $1+\eta$ as in Reference 7Reference 10Reference 11Reference 12.

A variant of the above results deals with the situation where the vector $x$ itself is drawn from a probability measure $Q$ on $\mathbb{R}^N$. In this case, the following result shows that we can first pick the matrix $\Phi$ so that (Equation 6.1) will hold with high probability on the choice of $x$. In other words, only a few pathological signals are not reconstructed up to the accuracy of best $k$-term approximation.

Corollary 6.6.

If $\Phi$ a random matrix of either Gaussian or Bernoulli type, then for any $\epsilon >0$ and $C_0>3$, there exists a constant $c_0$ such that if $n\geq c_0 k \log (N/n)$, the following holds: there exists a matrix $\Phi$ and a set $\Omega (\Phi )\subset \Omega$ with $Q(\Omega (\Phi ))\geq 1-2\epsilon$ such that (Equation 6.1) holds for all $x\in \Omega (\Phi )$.

Proof.

Consider random matrices of Gaussian or Bernoulli type, and denote by $P$ their probability law. We consider the law $P\otimes Q$ which means that we draw independently $\Phi$ according to $P$ and $x$ according to $Q$ . We denote by $\Omega _{x}$ and $\Omega _\Phi$ the events that (Equation 6.1) does not hold given $x$ and $\Phi$, respectively. The event $\Omega _0$ that (Equation 6.1) does not hold is therefore given by

$$\begin{equation} \Omega _0= \bigcup _{x}\Omega _x=\bigcup _{\Phi }\Omega _\Phi . \tag{6.9} \end{equation}$$

According to Corollary 6.4 we know that for all $x\in \mathbb{R}^N$,

$$\begin{equation} P(\Omega _x) \leq \epsilon , \tag{6.10} \end{equation}$$

and therefore

$$\begin{equation} P\otimes Q(\Omega _0) \leq \epsilon . \tag{6.11} \end{equation}$$

By Chebyshev’s inequality, we have for all $t>0$,

$$\begin{equation} P( \{\Phi \; :\;Q(\Omega _\Phi )\geq t\}) \leq \frac{\epsilon }{t}, \tag{6.12} \end{equation}$$

and in particular

$$\begin{equation} P( \{\Phi \; :\; Q(\Omega _\Phi )\geq 2\epsilon \}) \leq \frac{1}{2}. \tag{6.13} \end{equation}$$

This shows that there exists a matrix $\Phi$ such that $Q(\Omega _\Phi )\leq 2\epsilon$, which means that for such a $\Phi$ the estimate (Equation 6.1) holds with probability larger than $1-2\epsilon$ over $x$.

■

We close this section with a few remarks comparing the results of this section with other results in the literature. The decoder defined by (Equation 6.3) is not computationally realistic since it requires a combinatorial search over all subsets $T$ of cardinality $T$. A natural question is therefore to obtain a decoder with similar approximation properties and more reasonable computational cost. Let us mention that fast decoding methods have been obtained for certain random constructions of matrices by Cormode and Muthukrishnan Reference 7 and by Gilbert and coworkers Reference 12Reference 17 that yield approximation properties which are similar to Theorem 6.3. Our results differ from theirs in the following two ways. First, we give general criteria for instance optimality to hold in probability. In this context we have not been concerned about the decoder. Our results can hold in particular for standard random classes of matrices such as the Gaussian and Bernoulli constructions. Secondly, when applying our results to these standard random classes, we obtain the range of $n$ given by $n\geq ck\log (N/n)$ which is slightly wider than the range in these other works. That latter range is also treated in Reference 17 but the corresponding results are confined to $k$-sparse signals. It is shown there that orthogonal matching pursuit (OMP) identifies the support of such a sparse signal with high probability and that the orthogonal projection will then recover it precisely.

7. The case $X=\ell _p$ with $1<p<2$

In this section we shall discuss instance optimality in the case $X=\ell _p$ when $1<p<2$. We therefore discuss the validity of

$$\begin{equation} \|x-\Delta (\Phi x)\|_{\ell _p} \leq C_0 \sigma _k(x)_{\ell _p}, \;\;\; x\in \mathbb{R}^N, \cssId{instp}{\tag{7.1}} \end{equation}$$

depending on the value of $n$. Our first result is a generalization of Lemma 4.1.

Lemma 7.1.

Let $\Phi$ be any matrix which satisfies the RIP of order $2k+\tilde{k}$ with $\delta _{2k+\tilde{k}} \leq \delta <1$ and

$$\begin{equation} \tilde{k}:=k \Bigl (\frac{N}{k}\Bigr )^{2-2/p}. \cssId{tkp}{\tag{7.2}} \end{equation}$$

Then, for any $1\le p<2$, $\Phi$ satisfies the null space property in $\ell _p$ of order $2k$ with constant $C_0= 2^{\frac{1}{p}-\frac{1}{2}}\frac{1+\delta }{1-\delta }$.

Proof.

The proof is very similar to Lemma 4.1, so we sketch it. The idea is to take once again $T_0=T$ to be the set of $2k$ largest coefficients of $\eta$ and to take the other sets $T_j$ of size $\tilde{k}$.

In the same way, we obtain

$$\begin{equation} \|\eta _{T_0}\|_{\ell _2} \le (1+\delta ) (1-\delta )^{-1} \sum _{j=2}^s\|\eta _{T_j}\|_{\ell _2}. \tag{7.3} \end{equation}$$

Now if $j\geq 1$, for any $i\in T_{j+1}$ and $i'\in T_j$, we have $|\eta _i|\leq |\eta _{i'}|$ so that $|\eta _i|^p \leq \tilde{k}^{-1}\|\eta _{T_j}\|^p_{\ell _p}$. It follows that

$$\begin{equation} \|\eta _{T_{j+1}}\|_{\ell _2} \leq (\tilde{k})^{1/2-1/p}\|\eta _{T_j}\|_{\ell _p}, \tag{7.4} \end{equation}$$

so that

$$\begin{equation} \begin{array}{ll} \|\eta _{T}\|_{\ell _p}& \leq (2k)^{1/p-1/2} \|\eta _{T}\|_{\ell _2} \\& \leq (1+\delta ) (1-\delta )^{-1} (2k)^{1/p-1/2}\tilde{k}^{1/2-1/p}\sum _{j=1}^s\|\eta _{T_j}\|_{\ell _p} \\& \leq (1+\delta ) (1-\delta )^{-1}(2k)^{1/p-1/2}\tilde{k}^{1/2-1/p}s^{1-1/p} \|\eta _{T^c}\|_{\ell _p} \\& \leq (1+\delta ) (1-\delta )^{-1}(2k)^{1/p-1/2}\tilde{k}^{1/2-1/p} (N /{\tilde{k}})^{1-1/p}\|\eta _{T^c}\|_{\ell _p}\\&=2^{1/p-1/2}(1+\delta ) (1-\delta )^{-1}\|\eta _{T^c}\|_{\ell _p}, \end{array} \cssId{already}{\tag{7.5}} \end{equation}$$

where we have used Hölder’s inequality twice and the relation between $N$, $k$ and $\tilde{k}$.

■

The corresponding generalization of Theorem 4.2 is now the following.

Theorem 7.2.

Let $\Phi$ be any matrix which satisfies the RIP of order $2k+\tilde{k}$ with $\delta _{2k+\tilde{k}} \leq \delta <1$ and $\tilde{k}$ as in (Equation 7.2). Define the decoder $\Delta$ for $\Phi$ as in (Equation 3.4) for $X=\ell _p$. Then (Equation 7.1) holds with constant $C_0=2^{1/ p+1/ 2} (1+\delta )/(1-\delta )$.

Recall from our earlier remarks that an $n\times N$ matrix $\Phi$ can have RIP of order $\tilde{k}$ provided that $\tilde{k}\leq c_0 n/\log (N/n)$. We therefore conclude from Theorem 7.2 and (Equation 7.2) that instance optimality of order $k$ in the $\ell _p$ norm can be achieved at the price of ${\mathcal{O}}(k(N/k)^{2-2/p}\log (N/n))$ measurements so that the order of ${\mathcal{O}}(k(N/k)^{2-2/p}\log (N/k))$ measurements suffices, which is now significantly higher than $k$ except in the case where $p=1$. In the following, we prove that this price cannot be avoided.

Theorem 7.3.

For any $s<2-2/p$ and any matrix $\Phi$ of dimension $n\times N$, property (Equation 7.1) implies that

$$\begin{equation} n \geq c k\Bigl (\frac{N}{k}\Bigr )^{s}, \cssId{necessellp}{\tag{7.6}} \end{equation}$$

with $c=\Bigl (\frac{C_1}{C_0}\Bigr )^{\frac{2/q-1}{1/q-1/p}}$ where $C_0$ is the constant in (Equation 7.1) and $C_1$ the lower constant in (Equation 2.17) and $q$ is defined by the relation $s=2-2/q$.

Proof.

We shall use the results of §2 concerning the Gelfand width and the rate of best $k$-term approximation. If (Equation 1.11) holds, we find that for any compact class $K\subset \mathbb{R}^N$

$$\begin{equation} E_n(K)_{\ell _p} \leq C_0 \sigma _k(K)_{\ell _p}. \cssId{compgelsigmaK}{\tag{7.7}} \end{equation}$$

We now consider the particular classes $K:=U(\ell _q^N)$ with $1\leq q<p$, so that in view of (Equation 2.6) and (Equation 2.17), the inequality (Equation 7.7) becomes

$$\begin{equation} C_1 (N^{1-1/q}n^{-1/2})^{\frac{1/q-1/p}{1/q-1/2}} \leq C_0k^{1/p-1/q}, \tag{7.8} \end{equation}$$

which gives (Equation 7.6) with $s=2-2/q$ and $c=\Bigl (\frac{C_1}{C_0}\Bigr )^{\frac{2/q-1}{1/q-1/p}}$.

■

Remark 7.4.

In the above proof the constant $c$ blows up as $q$ approaches $p$ and therefore we cannot directly conclude that a condition of the type $n \geq c k(N/ k)^{2-2/p}$ is necessary for (Equation 7.1) to hold, although this seems plausible.

8. Mixed-norm instance optimality

In this section, we extend the study of instance optimality to more general estimates of the type

$$\begin{equation} \|x-\Delta (\Phi x)\|_{X} \leq C_0k^{-s} \sigma _k(x)_Y,\;\;\; x\in \mathbb{R}^N, \cssId{mixedinst}{\tag{8.1}} \end{equation}$$

which we refer to as mixed-norm instance optimality. We have in mind the situation where $X=\ell _p$ and $Y=\ell _q$ with $1\leq q\leq p\leq 2$ and $s=1/q-1/p$. We are thus interested in estimates of the type

$$\begin{equation} \|x-\Delta (\Phi x)\|_{\ell _p} \leq C_0k^{1/p-1/q} \sigma _k(x)_{\ell _q},\;\;\; x\in \mathbb{R}^N. \cssId{mixedinstpq}{\tag{8.2}} \end{equation}$$

The interest in such estimates stems from the following fact. Considering the classes $K=U(\ell ^N_r)$ for $r<q$, we know from (Equation 2.8) that

$$\begin{equation} k^{1/p-1/q} \sigma _k(K)_{\ell _q} \sim k^{1/p-1/q} k^{1/q-1/r} = k^{1/p-1/r} \sim \sigma _k(K)_{\ell _p}. \cssId{sigmaKeq}{\tag{8.3}} \end{equation}$$

Therefore the estimate (Equation 8.2) yields the same rate of approximation as (Equation 7.1) over such classes, and on the other hand we shall see that it is valid for smaller values of $n$.

Our first result is a trivial generalization of Theorem 3.2 and Corollary 3.3 to the case of mixed-norm instance optimality, so we state it without proof. We say that $\Phi$ has the mixed null space property in $(X,Y)$ of order $k$ with constant $C$ and exponent $s$ if

$$\begin{equation} \|\eta \|_X \leq Ck^{-s}\|\eta _{T^c}\|_Y, \cssId{NSPXY}{\tag{8.4}} \end{equation}$$

$\eta \in {\mathcal{N}}$ and $\#(T)\leq k$.

Theorem 8.1.

Assume given a norm $\|\cdot \|_X$, an integer $k>0$ and an encoding matrix $\Phi$. If $\Phi$ has the mixed null space property in $(X,Y)$ of order $2k$ with constant $C_0/2$ and exponent $s$, then there exists a decoder $\Delta$ so that $(\Phi ,\Delta )$ satisfies (Equation 8.1) with constant $C_0$. Conversely, the validity of (Equation 8.1) for some decoder $\Delta$ implies that $\Phi$ has the null space property in $(X,Y)$ of order $2k$ with constant $C_0$ and exponent $s$.

We next give a straightforward generalization of Lemma 7.1.

Lemma 8.2.

Let $\Phi$ be any matrix which satisfies the RIP of order $2k+\tilde{k}$ with $\delta _{2k+\tilde{k}} \leq \delta <1$ and

$$\begin{equation} \tilde{k}:=k \Bigl (\frac{N}{k}\Bigr )^{2-2/q}. \cssId{tkpq}{\tag{8.5}} \end{equation}$$

Then $\Phi$ satisfies the mixed null space property in $(\ell _p,\ell _q)$ of order $2k$ with constant $C_0= 2^{\frac{1}{p}+\frac{1}{2}}\frac{1+\delta }{1-\delta }+2^{\frac{1}{p}-\frac{1}{q}}$ and exponent $s=1/q-1/p$.

Proof.

As in the proof of Lemma 7.1, we take $T_0=T$ to be the set of $2k$ largest coefficients of $\eta$ and we take the other sets $T_j$ of size $\tilde{k}$. By similar arguments, we arrive at the chain of inequalities

$$\begin{eqnarray} \|\eta _{T}\|_{\ell _p}&\leq &(2k)^{1/p-1/2} \|\eta _{T}\|_{\ell _2}\\ & \leq &(1+\delta ) (1-\delta )^{-1} (2k)^{1/p-1/2}\tilde{k}^{1/2-1/q}\sum _{j=1}^s\|\eta _{T_j}\|_{\ell _q} \\ & \leq &(1+\delta ) (1-\delta )^{-1}(2k)^{1/q-1/2}\tilde{k}^{1/2-1/q}s^{1-1/q} \|\eta _{T^c}\|_{\ell _q} \\ & \leq &(1+\delta ) (1-\delta )^{-1}(2k)^{1/q-1/2}\tilde{k}^{1/2-1/q} (N /{\tilde{k}})^{1-1/q}\|\eta _{T^c}\|_{\ell _q} \\ &= &2^{1/p-1/2}(1+\delta ) (1-\delta )^{-1}k^{-s}\|\eta _{T^c}\|_{\ell _q}, \cssId{front}{\tag{8.6}} \end{eqnarray}$$

where we have used Hölder’s inequality both with $\ell _q$ and $\ell _p$ as well as the relation between $N$, $k$ and $\tilde{k}$.

It remains to bound the tail $\|\eta _{T^c}\|_{\ell _p}$. To this end, we infer from (Equation 2.4) that

$$\|\eta _{T^c}\|_{\ell _p} \leq \|\eta \|_{\ell _q}(2k)^{\frac{1}{p}-\frac{1}{q}}\leq \big (\|\eta _T\|_{\ell _q}+\|\eta _{T^c}\|_{\ell _q} \big )(2k)^{\frac{1}{p}-\frac{1}{q}}.$$

Invoking (Equation 7.5) for $p=q$ yields now

$$\|\eta _T\|_{\ell _q}\leq 2^{1/q-1/2}(1+\delta ) (1-\delta )^{-1}\|\eta _{T^c}\|_{\ell _q}$$

so that

$$\begin{equation} \|\eta _{T^c}\|_{\ell _p} \leq \Big (2^{\frac{1}{p} -\frac{1}{2}}(1+\delta )(1-\delta )^{-1} +2^{\frac{1}{p} -\frac{1}{q}} \Big ) \|\eta _{T^c}\|_{\ell _q}k^{\frac{1}{p} -\frac{1}{q}}. \cssId{compl1}{\tag{8.7}} \end{equation}$$

Combining (Equation 8.7) and (Equation 8.6) finishes the proof.

■

We see that considering mixed-norm instance optimality in $(\ell _p,\ell _q)$ in contrast to instance optimality in $\ell _q$ is beneficial since the value of $\tilde{k}$ is smaller in (Equation 8.5) than in (Equation 7.2). The corresponding generalization of Theorem 7.2 is now the following.

Theorem 8.3.

Let $\Phi$ be any matrix which satisfies the RIP of order $2k+\tilde{k}$. Define the decoder $\Delta$ for $\Phi$ as in (Equation 3.4) for $X=\ell _p$. Then (Equation 8.2) holds with constant $C_0=2^{\frac{1}{p}+\frac{3}{2}}\frac{1+\delta }{1-\delta }+2^{1+\frac{1}{p}-\frac{1}{q}}$.

By the same reasoning that followed Theorem 7.2 concerning the construction of matrices which satisfy RIP, we conclude that mixed instance optimality of order $k$ in the $\ell _p$ and $\ell _q$ norms can be achieved at the price of ${\mathcal{O}}(k(N/k)^{2-2/q}\log (N/k))$ measurements. In particular, we see that when $q=1$, this type of mixed-norm estimate can be obtained with $n$ larger than $k$ only by a logarithmic factor. Such a result was already observed in Reference 4 in the case $p=2$ and $q=1$. In view of (Equation 8.3) this implies in particular that compressed sensing behaves as well as best $k$-term approximation on classes such as $K=U(\ell _r^N)$ for $r<1$.

One can prove that the above number of measurements is also necessary. This is expressed by a straightforward generalization of Theorem 7.3 that we state without proof.

Theorem 8.4.

For any matrix $\Phi$ of dimension $n\times N$, property (Equation 8.2) implies that

$$\begin{equation} n \geq c k\Bigl (\frac{N}{k}\Bigr )^{2-2/q}, \cssId{necessellp2}{\tag{8.8}} \end{equation}$$

with $c=\Bigl (\frac{C_1}{C_0}\Bigr )^{\frac{2/q-1}{1/q-1/p}}$ where $C_0$ is the constant in (Equation 7.1) and $C_1$ the lower constant in (Equation 2.17).

Remark 8.5.

In general, there is no direct relationship between (Equation 7.1) and (Equation 8.2). We give an example to bring out this fact. Let us consider a fixed value of $1<p\le 2$ and values of $N$ and $k<N/2$. We define $x$ so that its first $k$ coordinates are $1$ and its remaining $N-k$ coordinates are in $(0,1)$. Then $\sigma _k(x)_{\ell _r}=\|z\|_{\ell _r}$ where $z$ is obtained from $x$ by setting the first $k$ coordinates of $x$ equal to zero. We can choose $z$ so that $1/2\le \|z\|_{\ell _r}\le 2$, for $r=p,q$. In this case, the right side in (Equation 8.2) is smaller than the right side of (Equation 7.1) by the factor $k^{1/p-1/q}$ so an estimate in the mixed-norm instance optimality sense is much better for this $x$. On the other hand, if we take all nonzero coordinates of $z$ to be $a$ with $a\in (0,1)$, then the right side of (Equation 7.1) will be smaller than the right side of (Equation 8.2) by the factor $(N/k)^{1/p-1/q}$, which shows that for this $x$ the instance optimality estimate is much better.

Compressed sensing and best $k$-term approximation

Abstract

1. Introduction

2. Performance over classes

3. Instance optimality and the null space of $\Phi$

4. The case $X=\ell _1$

5. The case $X=\ell _2$

6. The case $X=\ell _2$ in probability

7. The case $X=\ell _p$ with $1<p<2$

8. Mixed-norm instance optimality

Table of Contents

Mathematical Fragments

References

Article Information

Settings