# Compressed sensing and best approximation -term

## Abstract

Compressed sensing is a new concept in signal processing where one seeks to minimize the number of measurements to be taken from signals while still retaining the information necessary to approximate them well. The ideas have their origins in certain abstract results from functional analysis and approximation theory by Kashin but were recently brought into the forefront by the work of Candès, Romberg, and Tao and of Donoho who constructed concrete algorithms and showed their promise in application. There remain several fundamental questions on both the theoretical and practical sides of compressed sensing. This paper is primarily concerned with one of these theoretical issues revolving around just how well compressed sensing can approximate a given signal from a given budget of fixed linear measurements, as compared to adaptive linear measurements. More precisely, we consider discrete signals allocate , linear measurements of and we describe the range of , for which these measurements encode enough information to recover in the sense of to the accuracy of best approximation. We also consider the problem of having such accuracy only with high probability. -term

## 1. Introduction

The typical paradigm for obtaining a compressed version of a discrete signal represented by a vector is to choose an appropriate basis, compute the coefficients of in this basis, and then retain only the largest of these with If we are interested in a bit stream representation, we also need in addition to quantize these . coefficients.

Assuming, without loss of generality, that already represents the coefficients of the signal in the appropriate basis, this means that we pick an approximation to in the set of vectors -sparse

where is the support of i.e., the set of , for which and , is the number of elements in the set The best performance that we can achieve by such an approximation process in some given norm . of interest is described by the *best approximation -term* error

This approximation process should be considered as *adaptive* since the indices of those coefficients which are retained vary from one signal to another. On the other hand, this procedure is stressed on the front end by the need to first compute all of the basis coefficients. The view expressed by Candès, Romberg, and Tao Reference5Reference3Reference4 and Donoho Reference8 is that since we retain only a few of these coefficients in the end, perhaps it is possible to actually compute only a few *nonadaptive* linear measurements in the first place and still retain the necessary information about in order to build a compressed representation.

These ideas have given rise to a very lively area of research called *compressed sensing* which poses many intriguing questions, of both a theoretical and practical flavor. The present paper is an excursion into this area, focusing our interest on the question of just how well compressed sensing can perform in comparison to best approximation. -term

To formulate the problem, we are given a budget of questions we can ask about These questions are required to take the form of asking for the values . where the are fixed linear functionals. The information we gather about can therefore by described by

where is an matrix called the *encoder* and is the *information vector*. The rows of are representations of the linear functionals , .

To extract the information that holds about we use a ,*decoder* which is a mapping from We emphasize that . is not required to be linear. Thus, is our approximation to from the information we have retained. We shall denote by the set of all encoder-decoder pairs with an matrix.

There are two common ways to evaluate the performance of an encoding-decoding pair The first is to ask for the largest value of . such that the encoding-decoding is exact for all vectors, i.e., -sparse

It is easy to see (see §2) that given there are , such that (Equation1.4) holds for all Or put in another way, given . we can achieve exact recovery on , whenever Unfortunately such encoder/decoder pairs are not numerically friendly as is explained in §2. .

Generally speaking, our signal will not be in with small but may be approximated well by the elements in Therefore, we would like our algorithms to perform well in this case as well. One way of comparing compressed sensing with best . approximation is to consider their respective performance on a specific class of vectors -term For such a class we can define on the one hand .

and

which describe, respectively, the performance of the two methods over this class. We are now interested in the largest value of such that for a constant independent of the parameters Results of this type were established already in the 1970’s under the umbrella of what is called . The deepest results of this type were given by Kashin -widths.Reference14 with later improvements by Garnaev and Gluskin Reference9Reference13. We recall this well-known story briefly in §2.

The results on referred to above give matching upper and lower estimates for -widths in the case that is a typical sparsity class such as a ball in where

This in turn determines the largest range of for which we can obtain comparisons of the form One such result is the following: for . , one has ,

whenever

with absolute constants .

The decoders used in proving these theoretical bounds are far from being practical or numerically implementable. One of the remarkable achievements of the recent work of Candès, Romberg and Tao Reference3 and Donoho Reference8 is to give probabilistic constructions of matrices which provide these bounds where the decoding can be done by solving the minimization problem

The above results on approximation of classes is governed by the worst elements in the class. It is a more subtle problem to obtain estimates that depend on the individual characteristics of the target vector The main contribution of the present paper is to study a stronger way to compare the performance of . approximation in a compressed sensing algorithm. Namely, we address the following question: -term

*For a given norm and what is the minimal value of , for which there exists a pair such that*

*for all with , a constant independent of and ?*

If a result of the form (Equation1.11) has been established, then one can derive a result for a class by simply taking the supremum over all However, results on classes are less precise and informative than ( .Equation1.11).

We shall say a pair satisfying (Equation1.11) is *instance optimal* of order with constant for the space In particular, we want to understand under what circumstances the minimal value of . is roughly of the same order as similar to ( ,Equation1.9). We shall see that the answer to this question strongly depends on the norm under consideration.

The approximation accuracy of a compressed sensing matrix is determined by the null space

The importance of is that if we observe without any a priori information on the set of , such that is given by the affine space

We bring out the importance of the null space in §3 where we formulate a property of the null space which is necessary and sufficient for to have a decoder for which the instance optimality (Equation1.11) holds.

We apply this property in §4 to the case In this case, we show the minimal number of measurements . which ensures (Equation1.11) is of the same order as up to a logarithmic factor. In that sense, compressed sensing performs almost as well as best approximation. We also show that, similar to the work of Candès, Romberg, and Tao this is achieved with the decoder -term defined by minimization. We should mention that our results in this section are essentially contained in the work of Candès, Romberg, and Tao Reference5Reference6Reference4 and we build on their ideas.

We next treat the case in §5. In this case, the situation is much less in favor of compressed sensing, since the minimal number of measurements which ensures (Equation1.11) is now of the same order as .

In §6, we consider an important variant of the case where we ask for instance optimality in the sense of probability. Here, rather than requiring that (Equation1.11) holds for all we ask only that for each given , it holds with high probability. We shall see that in the case the minimal number of measurements for such results is dramatically reduced, down to the order given by condition (Equation1.9). Moreover, we show that standard constructions of random matrices such as Gaussian and Bernoulli ensembles achieve this performance.

The striking contrast between the results of §5 and §6 shows that the probabilistic setting plays a crucial role in instance optimality. Similar results in the sense of probability have been obtained earlier in a series of paper Reference7Reference10Reference11Reference12 that reflect the theoretical computer science approach to compressed sensing, also known as data sketching. A comparison with our results is in order.

First, the instance optimality bounds obtained in these papers are quantitatively more precise than ours, since they have the general form

where can be made arbitrarily small, at the expense of raising while in most of our results the constant , in (Equation1.11) cannot get arbitrarily close to On the other hand, for a fixed . the ratio between , and is generally not as good as in (Equation1.9): for instance the decoders proposed in Reference7 and Reference11, respectively, use and samples in order to achieve (Equation1.14).

Secondly, the types of encoding matrices which are proposed in these papers are of fairly different nature than those which are considered in §6, and our analysis actually *does not apply* to these matrices. Let us mention that one specific interest of the Gaussian matrices which are considered in the present paper is that they give rise to an encoding which is “robust” with respect to a change of the basis in which the signal is sparse, since the product of such a and any unitary matrix results in a matrix with the same probability law.

Finally, one of the significant achievements in Reference7Reference10Reference11Reference12 is the derivation of practical decoding algorithms of polynomial complexity in up to logarithmic factors, therefore typically faster than solving the minimization problem, while we do not propose any such algorithm in the present paper.

Generally speaking, an important issue in compressed sensing is the practical implementation of the decoder by a fast algorithm. While being aware of this fact, the main goal of the present paper is to understand the theoretical limits of compressed sensing in comparison to nonlinear approximation. Therefore the main question that we address is, *“How many measurements do we need so that some decoder recovers up to some prescribed tolerance?”*, rather than, *“What is the fastest algorithm which allows to recover from these measurements up to the same tolerance?”*

The last sections of the paper are devoted to additional results which complete the theory. In order to limit the size of the paper, we only give a sketch of the proofs in those sections. The case for is treated in §7, and in §8 we discuss another type of estimate that we refer to as *mixed-norm instance optimality*. Here the estimate (Equation1.11) is replaced by an estimate of the type

where differs from and is some relevant exponent. This type of estimate was introduced in Reference4 in the particular case and We give examples in the case . and in which mixed-norm estimates allow us to recover better approximation estimates for compressed sensing than (Equation1.11).

## 2. Performance over classes

We begin by recalling some well-known results concerning best approximation which we shall use in the course of this paper. Given a sequence norm -term on and a positive integer we define the approximation class , by means of

Notice that since we are in a finite dimensional space this (quasi-)norm will be finite for all , .

A simple, yet fundamental, chapter in approximation is to connect the approximation norm in ( -termEquation2.1) with traditional sequence norms. For this, we define for any the weak , norm as

Again, for any all of these norms are finite.

If we fix the norm in which approximation error is to be measured, then for any we have for , ,

for two absolute constants Notice that the constants in these inequalities do not depend on . Therefore, . is equivalent to with equivalent norms.

Since the norm is larger than the weak norm, we can replace the weak norm by the norm in the right inequality of (Equation2.3). However, the constant can be improved via a direct argument. Namely, if then for any , ,

To prove this, take as the set of indices corresponding to the largest entries in If . is the size of the smallest entry in then , and therefore

so that (Equation2.4) follows.

From this, we see that if we consider the class we have ,

with On the other hand, taking . such that for indices and otherwise, we find that

so that can be framed by

We next turn to the performance of compressed sensing over *classes of vectors*, by studying the quantity defined by (Equation1.6). As we have mentioned, the optimal performance of sensing algorithms is closely connected to the concept of Gelfand widths which are in some sense dual to the perhaps better known Kolmogorov widths. If is a compact set in and is a positive integer, then the Gelfand width of and of order is by definition

where the infimum is taken over all subspaces of of codimension less than or equal to This quantity is equivalent to . according to the following well-known result. ,

### Lemma 2.1

Let be any set for which and for which there is a such that If . is any normed space, then

### Proof.

We give a proof for completeness of this paper. We first remark that the null space of is of codimension less than or equal to Conversely, given any space . of codimension we can associate its orthogonal complement , which is of dimension and the matrix whose rows are formed by any basis for Through this identification, we see that .

where the infimum is taken over all matrices .

Now, if ( is any encoder-decoder pair and ) then for any , we also have , It follows that either . or Since . we conclude that ,

Taking an infimum over all encoder-decoder pairs in we obtain the left inequality in ( ,Equation2.10).

To prove the right inequality, we choose an optimal for and use the matrix associated to (i.e., the rows of are a basis for We define a decoder ). for as follows. Given in the range of we recall that , is the set of such that If . we take any , and define When . we define , as any element from This gives .

where we have used the fact that and by our assumptions on This proves the right inequality in ( .Equation2.10).

The orders of the Gelfand widths of balls in are known except perhaps for the case For the range of . that is relevant here even the constants are known. We recall the following results of Gluskin, Garnaev and Kashin which can be found in Reference13Reference9Reference14; see also Reference15. For we have ,

where only depend on and and where

and

Since obviously satisfies the assumptions of Lemma 2.1 with we also have ,

From (Equation2.14), (Equation2.16), (Equation2.10) we deduce indeed the announced fact that can only hold when and the necessary number of measurements are interrelated by (Equation1.9). The possible range of for which even instance optimality could hold is therefore also limited by (Equation1.9), a relation that will turn up frequently in what follows.

## 3. Instance optimality and the null space of

We now turn to the main question addressed in this paper, namely the study of instance optimality as expressed by (Equation1.11). In this section, we shall see that (Equation1.11) can be reformulated as a property of the null space of As was already remarked in the proof of Lemma .2.1, this null space has codimension not larger than .

We shall also need to consider sections of obtained by keeping some of its columns: for we denote by , the matrix formed from the columns of with indices in Similarly we shall have to deal with restrictions . of vectors to sets However, it will be convenient to view such restrictions still as elements of . i.e., , agrees with on and has all components equal to zero whose indices do not belong to .

We begin by studying under what circumstances the measurement vector uniquely determines each vector -sparse This is expressed by the following trivial lemma. .

### Lemma 3.1

If is any matrix and then the following are equivalent: ,

(i) There is a decoder such that for all , .

(ii) .

(iii) For any set with the matrix , has rank .

(iv) The symmetric nonnegative matrix is invertible, i.e., positive definite.

### Proof.

The equivalence of (ii), (iii), (iv) is linear algebra.

(i) Suppose (i) holds and (ii): We can write . where both Since . we have, by (i), that , and hence .

(ii) Given any (i): we define , to be any element in with smallest support. Now, if with then , From (ii), this means that . Hence, if . then , as desired.

The properties discussed in Lemma 3.1 are algebraic properties of If . are fixed, the question arises as to how large we need to make so that there is a matrix having the properties of the lemma. It is easy to see that we can take Indeed, for any . and we can find a set , of vectors in such that any of them are linearly independent. For example if then the matrix whose , entry is has the properties of Lemma 3.1. Its minors are Vandermonde matrices which are well known to be nonsingular. Unfortunately, such matrices are poorly conditioned when is large and the process of recovering from is therefore numerically unstable.

Stable recovery procedures have been proposed by Candès, Romberg, and Tao and by Donoho under stronger conditions on We shall make heavy use in this paper of the following property introduced by Candès and Tao. We say that . satisfies the *restricted isometry property* (RIP) of order if there is a such that

holds for all of cardinality .Footnote^{1} The RIP condition could be replaced by the assumption that holds for all with absolute constants , in all that follows. However, this latter condition is equivalent to having a rescaled matrix satisfy RIP for some and the rescaled matrix extracts exactly the same information from a vector as does.^{✖} The RIP condition is equivalent to saying that the symmetric matrix is positive definite with eigenvalues in Note that RIP of order . always implies RIP of order Note also that RIP of order . guarantees that the properties of Lemma 3.1 hold.

Candès and Tao have shown that any matrix which satisfies the RIP property for and sufficiently small will extract enough information about to approximate it well and moreover the decoding can be done by minimization. The key question then is, given a fixed how large can we take , and still have matrices which satisfy RIP for It was shown by Candès and Tao ?Reference5, as well as Donoho Reference8, that certain families of random matrices will, with high probability, satisfy RIP of order with for some prescribed *independent of * provided Here . is a constant which when made small will make small as well. It should be stressed that all available constructions of such matrices (so far) involve random variables. For instance, as we shall recall in more detail in §6, the entries of can be picked as i.i.d. Gaussian or Bernoulli variables with proper normalization.

We turn to the question of whether contains enough information to approximate to accuracy as expressed by (Equation1.11). The following theorem shows that this can be understood through the study of the null space of .

### Theorem 3.2

Given an matrix a norm , and a value of then a sufficient condition that there exists a decoder , such that (Equation1.11) holds with constant is that

A necessary condition is that

### Proof.

To prove the sufficiency of (Equation3.2), we will define a decoder for as follows. Given any we consider the set , and choose

We shall prove that for all

Indeed, is in and hence by (Equation3.2), we have

where the second inequality uses the fact that and the last inequality uses the fact that minimizes over .

To prove the necessity of (Equation3.3), let be any decoder for which (Equation1.11) holds. Let be any element in and let be the best approximation of -term in Letting . be any splitting of into two vectors of support size we can write ,

with Since . we have by ( ,Equation1.11) that but since , we also have , so that Using again ( .Equation1.11), we derive

which is (Equation3.3).

When is an space, the best approximation is obtained by leaving the -term largest components of unchanged and setting all the others to Therefore the property .

can be reformulated by saying that

holds for all such that where , is the complement set of in In going further, we shall say that . has the *null space property* in of order with constant if (Equation3.8) holds for all and Thus, we have .

### Corollary 3.3

Suppose that is an space, an integer and an encoding matrix. If has the null space property (Equation3.8) in of order with constant then there exists a decoder , so that satisfies (Equation1.11) with constant Conversely, the validity of .(Equation1.11) for some decoder implies that has the null space property (Equation3.8) in of order with constant .

In the next two sections, we shall use this corollary in order to study instance optimality in the case where the norm is and respectively. ,

## 4. The case

In this section, we shall study the null space property (Equation3.8) in the case where We shall make use of the restricted isometry property ( .Equation3.1) introduced by Candès and Tao. We begin with the following lemma whose proof is inspired by results in Reference4.

### Lemma 4.1

Let , with integers. If is any matrix which satisfies the RIP of order with Then . satisfies the null space property in of order with constant .

### Proof.

It is enough to prove (Equation3.8) in the case when is the set of indices of the largest entries of Let . , denote the set of indices of the next largest entries of , the next largest, and so on. The last set defined this way may have less than elements.

We define Since . we have , so that ,

where we have used both bounds in (Equation3.1). Now for any and we have , so that It follows that .

so that

By the Cauchy-Schwartz inequality and we therefore obtain ,

which verifies the null space property with the constant .

Combining Corollary 3.3 and Lemma 4.1 (with and we have therefore proved the following. ),

### Theorem 4.2

Let be any matrix which satisfies the RIP of order Define the decoder . for as in (Equation3.4) for Then .(Equation1.11) holds in with constant Generally speaking, we cannot derive a constant of the type . from an analysis based on Lemma 4.1, since it requires that the null space property holds with constant which is therefore larger than .

As was mentioned in the previous section, one can build matrices which satisfy the RIP of order under the condition where is some fixed constant. We therefore conclude that instance optimality of order in the norm can be achieved at the price of measurements.

### Remark 4.3

More generally, if satisfies the RIP of order and is defined by (Equation3.4) for then ,(Equation1.11) holds in with constant Therefore, if we make . large, the constant in (Equation1.11) is of the type under a condition of the type .

Note that on the other hand, since instance optimality of order in any norm always implies that the reconstruction is exact when it cannot be achieved with less than , measurements according to Lemma 3.1.

Before addressing the case, let us briefly discuss the decoder which achieves (Equation1.11) for such a According to the proof of Theorem .3.2, one can build as the solution of the minimization problem (Equation3.4). It is not clear to us whether this minimization problem can be solved in polynomial time in The following result shows that it is possible to define . by minimization if satisfies the RIP with some additional control on the constants in (Equation3.1).

### Theorem 4.4

Let be any matrix which satisfies the RIP of order with Define the decoder . for as in (Equation1.10). Then, satisfies (Equation1.11) in with .

### Proof.

We apply Lemma 4.1 with , to see that satisfies the null space property in of order with constant This means that for any . and such that we have ,

and therefore

Let be the solution of (Equation1.10) so that and

Denoting by the set of indices of the largest coefficients of we can write ,

It follows that

and therefore

Using (Equation4.5) and the fact that we thus obtain ,

We finally use again (Equation4.4) to conclude that

which is the announced result.

## 5. The case

In this section, we shall show that instance optimality is not a very viable concept in in the sense that it will not even hold for unless We know from Corollary .3.3 that if is a matrix of size which satisfies

for some decoder then its null space , will need to have the property

### Theorem 5.1

For any matrix of dimension property ,(Equation5.2) with implies that .

### Proof.

We start from (Equation5.2) with from which we trivially derive

or equivalently for all ,

From this, we derive that for all ,

and therefore

with .

Let be the canonical basis of so that and let be an orthonormal basis for Denoting by . the orthognal projection onto we apply ( ,Equation5.6) to and find that for any

This means

We sum (Equation5.8) over and find

It follows that That is, . as desired.

The above result means that when measuring the error in the comparison between compressed sensing and best , approximation on a general vector of -term is strongly in favor of best approximation. However, this conclusion should be moderated in two ways. On the one hand, we shall see in §8 that one can obtain mixed-norm estimates of the form ( -termEquation1.15) from which one finds that compressed sensing compares favorably with best approximation over sufficiently concentrated classes of vectors. On the other hand, we shall prove in the next section that ( -termEquation5.1) can be achieved with of the same order as up to a logarithmic factor, if one accepts that this result holds with high probability.

## 6. The case in probability

In order to formulate the results of this section, we let be a probability space with probability measure and let , be an , random matrix. We seek results of the following type: for any if we draw , at random with respect to then ,

holds for this particular with high probability for some decoder (dependent on the draw We shall even give explicit decoders which will yield this type of inequality. It should be understood that ). is drawn independently for each in contrast to building a such that (Equation6.1) holds simultaneously for all which was our original definition of instance optimality. ,

Two simple instances of random matrices which are often considered in compressed sensing are

- (1)
Gaussian matrices: are i.i.d. Gaussian variables of variance ,

- (2)
Bernoulli matrices: are i.i.d. Bernoulli variables of variance .

In order to establish such results, we shall need that the random matrix has two properties which we now describe. The first of these relates to the restricted isometry property which we know plays a fundamental role in the performance of the matrix in compressed sensing.

### Definition 6.1

We say that the random matrix satisfies RIP of order with constant and probability if there is a set with such that for all the matrix , satisfies (Equation3.1) with constant .

This property has been shown for random matrices of the above Gaussian or Bernoulli type. Namely, given any and there is a constant , such that for all this property will hold with see ;Reference2Reference5Reference8Reference16.

The RIP controls the behavior of on or equivalently on all the , dimensional spaces spanned by any subset of of cardinality On the other hand, for a general vector . the image vector , might have a much larger norm than However, for standard constructions of random matrices the probability that . has large norm is small. We formulate this by the following definition.

### Definition 6.2

We say that the random matrix has the boundedness property with constant and probability if for each there is a set , with such that for all ,

Note that the property which is required in this definition is clearly weaker than asking that the spectral norm be not greater than with probability .

Again, this property has been shown for various random families of matrices and in particular for the Gaussian or Bernoulli families. Namely, given any this property will hold with constant , and with see ;Reference1 or the discussion in Reference2. Thus, the standard constructions of random matrices will satisfy both of these properties.

We now describe our process for decoding when , is our given realization of the random matrix. Let be any subset of column indices with and let be the linear subspace of which consists of all vectors supported on For this . we define ,

In other words, is chosen as the least squares minimizer of the residual in approximation by elements of Notice that . is supported on If . satisfies RIP of order then the matrix , is nonsingular and the nonzero entries of are given by

To decode we search over all subsets , of cardinality and choose

Our decoding of is now given by

The main result of this section is the following.

### Theorem 6.3

Assume that is a random matrix which satisfies RIP of order with constant and probability and also satisfies the boundedness property with constant and probability Then, for each . there exists a set , with such that for all and the estimate ,(Equation6.1) holds with Here the decoder . is given by (Equation6.6).

### Proof.

Let be arbitrary and let be the draw of the matrix from the random ensemble. We denote by the set of indices corresponding to the largest coefficients of Thus .

We consider the set where is the set in the definition of RIP in probability and is the set in the definition of boundedness in probability for the vector Then . For any . we have ,

We bound the second term by

where the first inequality uses the RIP and the fact that is a vector with support of size less than the third inequality uses the minimality of , and the fourth inequality uses the boundedness property in probability for .

By virtue of the remarks on the properties of Gaussian and Bernoulli matrices, we derive the following quantitative result.

### Corollary 6.4

If is a random matrix of either Gaussian or Bernoulli type, then for any and there exists a constant , such that if the following holds: for every , there exists a set , with such that (Equation6.1) holds for all and .

### Remark 6.5

Our analysis yields a constant of the form where , can be made arbitraritly small at the expense of raising and it is not clear to us how to improve this constant down to , as in Reference7Reference10Reference11Reference12.

A variant of the above results deals with the situation where the vector itself is drawn from a probability measure on In this case, the following result shows that we can first pick the matrix . so that (Equation6.1) will hold with high probability on the choice of In other words, only a few pathological signals are not reconstructed up to the accuracy of best . approximation. -term

### Corollary 6.6

If a random matrix of either Gaussian or Bernoulli type, then for any and there exists a constant , such that if the following holds: there exists a matrix , and a set with