I Introduction
The theory of Compressed Sensing (CS) establishes that the combinatorial problem of recovering the sparsest vector from a limited number of linear measurements can be solved in a polynomial time given that the measurements satisfy certain isometry conditions
[20]. CS can be directly applied for recovering signals that are naturally sparse in the standard basis. Meanwhile, CS has been extended to work with many other types of natural signals that can be represented by a sparse vector using a dictionary [1]. As an alternative to modelbased dictionaries such as wavelets [2], Dictionary Learning (DL) [3] is a datadriven algorithmic approach to build sparse representations for natural signals.Learning dictionaries over largescale databases of training images is a time and memory intensive process which results in a universal dictionary that works for most types of natural images. Meanwhile, several variations of DL algorithms have been proposed for realtime applications to make the sparse representations more adaptive^{1}^{1}1We must emphasize the difference between adapting and learning dictionaries although the two terms are sometimes used interchangeably. In this paper, by dictionary learning we refer to its typical usage, i.e. the process of applying the DL algorithm to a large database of training images that produces a universal dictionary. Adapting a dictionary is the process of applying the DL algorithm to only one or a small number of possibly corrupted images to produce a dictionary that specifically performs well for those images. in applications such as image denoising [4]
[5] and most recently, compressed sensing [6]. Particularly, the last application has been termed Blind Compressed Sensing (BCS) to differentiate it from the normal CS where the dictionary is assumed to be known and fixed. Clearly, one would expect BCS to improve CS recovery when the optimal sparsity basis is unknown, which is the case in most realworld applications. Unfortunately, the existing work on BCS for imaging is lacking in two directions: () empirical evaluations and () mathematical justification for the general case. These issues are discussed further below.
Empirical evaluations: In existing BCS works, such as [7], empirical evaluations on images are mainly limited to the image inpainting problem which can be viewed as a CS problem where the compressive measurements are in the standard basis. In [8], the generic CS problem is only tested on artificially generated sparse vectors. Tested images and different running scenarios for the algorithms in [7, 8, 9, 10] are rather limited and arguably not adequate in indicating the strengths and weaknesses of BCS in realworld imaging applications. Finally and most importantly, existing studies fail to compare the adaptive BCS recovery with the nonadaptive CS recovery based on universally learned dictionaries.

Mathematical justification: The original BCS effort [6] identifies the general unconstrained BCS problem as illposed. Subsequently, various structural constraints were proposed for the learned dictionary and were shown to ensure uniqueness at the cost of decreased flexibility. In a following effort [7], a different strategy was used to ensure the uniqueness without enforcing structural constraints on the dictionary. However, the uniqueness was only justified for the class of sparse signals that admit the blocksparse model by exploiting recent results from the area of lowrank matrix completion [11]. Finally, [8] and [9] take empirical approaches toward unconstrained DL based on compressive measurements but do not provide any justification for the uniqueness, convergence or the accuracy of the proposed DL algorithm.
The present work is different from existing efforts in BCS, both in terms of goals and methodology. In addition to the goal of having an objective function with a unique optimum, we would like the learned dictionary to be as close as possible to the ideal dictionary that is based on running the DL algorithm over the complete image. In other words, our goals include both convergence to a unique solution and high accuracy of the solution. Since no prior information is available about the structure of the ideal dictionary or the underlying signal, our method does not impose extra structural constraints on the learned dictionary^{2}^{2}2The constraint of having bounded column norm or Frobeniusnorm of the dictionary, which is used in virtually every dictionary learning algorithm, does not constrain the dictionary structure other than bounding it or its columns inside the unit sphere. Some examples of structural constraints used in [6] and [10] respectively are blockdiagonal dictionaries and sparse dictionaries. or the sparse coefficients.
Similar to most efforts in the area of compressive imaging, including the BCS framework, we employ a block compressed sensing or blockCS scheme for measurement and recovery of images [12, 13]. Unlike denseCS, where the image is recovered as a single vector using a dense sampling matrix, blockCS attempts to break down the high dimensional denseCS problem into many smallsized CS problems for each nonoverlapping block of the image. Some advantages of blockCS are: () blockCS results in a blockdiagonal sampling matrix which significantly reduces the amount of required memory for storing largescale sampling matrices, () decoding in extremely high dimensions^{3}^{3}3A typical consumer image has an order of pixels which would make it impractical to be recovered as a single vector using existing sparse recovery methods with cubic or quadratic time complexities. is computationally challenging in denseCS and () sparse modeling or learning sparsifying dictionaries for highdimensional global image characteristics is challenging and not well studied. Specifically, we study a blockCS scheme where each block is sampled using a distinct sampling matrix and show that it is superior to using a fixed block sampling matrix for BCS. One of our goals in this paper is to outperform nonadaptive sparse image recovery using universal dictionaries based on wellknown DL algorithms such as onlineDL [17] and KSVD [4] while overcoming challenges such as overfitting. Rather than focusing on new DL algorithms for BCS, we focus on the relationship between the blockCS measurements and the BCS performance in an unconstrained setup.
This paper is organized as follows. In Section II we review the dictionary learning problem under the settings of complete and compressive data. Before describing the details of our algorithm in Section IV, we present our main contributions regarding the uniqueness conditions and the DL accuracy in the presence of partial data in Section III. Simulation results are presented in Section V. Finally, we present the conclusion and a discussion of future directions in Section VI.
Ia Notation
Throughout the paper, we use the following rules. Uppercase letters are used for matrices and lowercase letters are used for vectors and scalars.
denotes the identity matrix of size
. We reserve the following notation: is the total number of blocks in an image, is the size of each block (e.g. an block has size ), is the number of compressive measurements per block (usually ), is the number of atoms in a dictionary, is the iteration count, denotes a dictionary, represents the vectorized image block (columnmajor) number , is the representation of (i.e. ), denotes the measurement matrix for block number , denotes the vector of compressive measurements (i.e. ). For simplicity, we drop the block index subscripts in , , and when a single block is under consideration. Similarly, we omit the iteration superscript in and when a single iteration is under study and it does not create confusion.The vector norm is defined as . The matrix operators and respectively represent the Kronecker and the Hadamard (or elementwise) products. The operator reshapes a matrix to its columnmajor vectorized format. The matrix inner product is defined as with denoting the matrix trace, i.e. the sum of diagonal entries of . The Frobeniusnorm of is defined as .
Finally, due to the frequent usage of Lasso regression
[14] in this paper, we use the following abstractions:In words, represents the model misfit and denotes the sparse coefficient vector. Also, note the obvious relationship:
Ii Problem Statement and Prior Art
Iia An overview of the dictionary learning problem
The Dictionary Learning (DL) problem can be compactly expressed as:
(P1) 
where represents the collective model misfit^{4}^{4}4There are alternative ways of expressing the DL problem that are related. For example, authors in [15], propose to minimize the sum of norms of coefficient vectors for a fixed (zero) representation error for each sample .:
As noted in [3] for a similar formulation of DL, (P1) represents a bilevel optimization problem :

The inner layer (also known as the lower level) problem consists of solving Lasso problems to get .

The outer layer (or upper level) problem consists of finding a that minimizes .
Note that even for largescale images () the lower level optimization can be handled efficiently by parallel programming because each block is processed independently. However, in a batchDL algorithm^{5}^{5}5Batch processing refers to the processing of all blocks at once while online processing refers to the onebyone processing of blocks in a streaming fashion., in contrast to the onlineDL [17], the upper level problem is centralized and combines the information collected from all blocks. In this paper, we use the batchDL approach to stay consistent with the mathematical analysis. However, in Section IV, an efficient algorithm is described for solving the batch problem. Similar to [17], the batch algorithm and its analysis can be extended to onlineDL for the best efficiency. Hereafter, we omit the prefix ‘batch’ in batchDL for simplicity.
The typical DL strategy that is used in most works [4, 5, 17, 18, 22, 23, 21] is to iterate between the inner and outer optimization problems until convergence to a local minimum. Expressed formally, the iterative procedure is:
(1) 
with
The algorithm starts from an initial dictionary that can be selected to be, for example, the overcomplete discrete cosine frame [4].
Perhaps surprisingly, the solution of (P1) is trivial without the additional constraint of having a bounded dictionary norm. To explain more, one can always reduce the model misfit by multiplying with a scalar and multiplying with , thus reducing the norm of while keeping fixed, leading to and . There are two typical bounding methods to solve this issue that are reviewed in e.g. [24, 25]: ) bounding the norm of each dictionary column or ) bounding . In this work, we use the second approach, i.e. the bound
, since it does not enforce a uniform distribution of column norms which makes the sparse representation more adaptive. As pointed out in
[25], using a Frobeniusnorm bound results in a weighted sparse representation problem (at the inner level) where some coefficients can have more priority over others in taking nonzero values. Additionally, having bounded column norms is a stronger constraint which makes the analysis more difficult when the dictionary is treated in its vectorized format (this becomes more clear in Section III). The typical method for bounding the dictionary is by projecting back the updated dictionary (at the end of each iteration) inside the constraint set. More details are provided in Section IV where we describe the DL algorithm.IiB Dictionary learning from compressive measurements
The problem of CS is to recover a sparse signal , or a signal that can be approximated by a sparse vector, from a set of linear measurements . When , the linear system is underdetermined and the solution set is infinite. However, it is not difficult to show that for a sufficiently sparse the solution to is unique [26]. Unfortunately, the problem of searching for the sparsest subject to is NPhard and impractical to solve for highdimensional . Meanwhile, the CS theory indicates that this problem can be solved in a polynomial time, using sparsity promoting solvers such as Lasso [14], given that satisfies the Restricted Isometry Property (RIP) [20].
CS also applies to a dense when it has a sparse representation of the form (with a sparse ). Measurements can be expressed as , where is called the projection matrix. It has been shown that most random designs of would yield RIP with high probabilities [27]. The compressive imaging problem can be expressed as:
(2) 
The wellknown basis pursuit signal recovery corresponds to the following asymptotic solution [31]:
(3) 
Hereafter, we focus on the blockCS framework where each represents an image block and represents the vector of compressive measurements for that block. The iterative DL procedure based on blockCS measurements can be written as:
(4) 
with
To distinguish (4) from the normal DL in (1) and other BCS formulations [6, 7, 8], we refer to this problem as Dictionary Learning from linear Measurements or simply DLM.
We could also arrange the blockwise measurements into a single system of linear equations:
(5) 
where
(6) 
represents the blockdiagonal measurement matrix. Our results can be easily extended to denseCS, i.e. CS with a dense . Although, the utility of denseCS would not allow sequential processing of blocks as required by an onlineDL framework, a batchDL framework is compatible with denseCS.
Iii Mathematical Analysis
The benefits of using a distinct for each block can be understood intuitively [9]. However, it is important to study the asymptotic behavior of DL when , as well as the nonasymptotic bounds for a finite .
In the first part of this section, we prove that the iterative DLM algorithm returns a unique solution with a probability that approaches one for large . Specifically, we show that the outer problem, known as the ‘dictionary update’ stage,
is unique for fixed ’s and also every inner problem
is unique for a fixed . Therefore, starting from an initial point , the sequence of DLM iterations forms a unique path.
To specify the accuracy of the DLM algorithm, we measure the expectation
(7) 
starting from the same (fixed) ’s. Meanwhile, the inner problem is precisely a noisy CS problem and its accuracy has been thoroughly studied [1]. Specifically, when where denotes the sparsity of , the inner CS problem for block can be solved exactly. The presented error analysis for a finite is limited to a single iteration of dictionary update. Nevertheless, the asymptotic conclusion as we present is that the DLM and DL solutions converge as approaches infinity.
Based on the above remarks, our analysis is focused on a single iteration of DLM. Therefore, for simplicity, we drop the iteration superscript in the rest of this section unless required.
First, we write in the standard quadratic format:
We can further write:
and
Letting and , the standard quadratic form of can be written as:
(8) 
with
Next, we shall specify the stochastic construction of blockCS measurements that we term the BIG measurement scheme.
Definition 1.
(BIG measurement) In a Blockbased Independent Gaussian or BIG measurement scheme, each entry of each block measurement matrix
is independently drawn from a zeromean random Gaussian distribution with variance
.The variance guarantees that
. Note that although our analysis focuses on Gaussian measurements, it is straightforward to extend it to the larger class of subGaussian measurements which includes the Rademacher and the general class of (centered) bounded random variables
[27].Iiia Uniqueness
Before presenting the uniqueness results, we review the matrix extension of the Chernoff inequality [28] that is summarized in the following lemma.
Lemma 2.
(Matrix Chernoff, Theorem 5.1.1 in [28]). Consider a finite sequence of independent, random and positive semidefinite Hermitian matrices that satisfy . Define the random matrix . Compute the expectation parameters: and Then, for ,
(9) 
and
(10) 
Furthermore,
(11) 
for and
(12) 
This lemma will be used to show that, with a high probability, the Hessian matrix of , which is a sum of random independent matrices, is full rank and invertible.
In the following theorem, let
denote the lower bound of the smallest eigenvalue of the covariance matrix
. Note that the covariance matrix must be full rank, or equivalently , otherwise even the original dictionary learning problem (based on the complete data) would not result in a unique solution. On top of that, the magnitude of has a direct impact on the condition number of the Hessian matrix and the numerical stability of DLM, as well as DL.Theorem 3.
In a BIG measurement scheme with , has a unique minimum with a high probability.
Proof.
Taking the derivative of with respect to and letting it equal to zero results in the linear equation . Thus, to prove that the solution is unique, we must show that the Hessian matrix is invertible (with a high probability) for large . Equivalently, we must show that the probability is close to one when . Since is a sum of independent matrices, we may use the matrix Chernoff inequality from Lemma 2. Hence, we must compute the following quantities:
and
Using properties of the Kronecker product,
where the inequality holds with probability for a random Gaussian measurement matrix [27]. Suppose the energy of every is bounded by some constant given that has bounded energy. Therefore, with probability , . Roughly speaking, assuming that pixel intensities are in the range , .
Again, using properties of the Kronecker product,
(13) 
Requiring , and that , is equivalent to . ∎
We have established that the upper level problem results in a unique solution with a high probability^{6}^{6}6Clearly, projecting the resultant dictionary onto the space of matrices with a constant Frobeniusnorm would preserve the uniqueness since there is only a single point on the sphere of constantnorm matrices that is closest to the current dictionary.. To complete this subsection, we use the following result from [29] that implies the lower level problem is unique.
Lemma 4.
[29] If entries of
are drawn from a continuous probability distribution on
, then for any and the lasso solution is unique with probability one.Since each is drawn a continuous probability distribution in the BIG measurement scheme, is also distributed continuously in the space and this establishes that is unique.
IiiB Accuracy
In this subsection, we shall compute stochastic upper bounds for the distance between the DLM solution and the DL solution for a single iteration of dictionary update and for fixed ’s. The extension of these results to multiple iterations is left as a future work. As before, in the following results, we omit the iteration superscript for simplicity.
Let us define the corresponding standard quadratic form for the upperlevel DL problem:
(14) 
where
For BIG measurements, . Therefore, it is easy to verify that , and where it is assumed that the data and coefficients are fixed. This leads to the following lemma that points out the unbiasedness of the compressive objective function.
Lemma 5.
Proof.
∎
The following crucial lemma implies that the two objective functions and get arbitrarily close as the number of blocks () approaches infinity.
Lemma 6.
We have simplified and customized Theorem III.1 of [30] for our problem here. Specifically, the bounds in [30] are tighter but more difficult to interpret. The above lemma states that, with a constant (high) probability that depends on , the deviation is inversely proportional to when the signal energy is evenly distributed among blocks.
Hereafter, to simplify the notation, let
(16) 
and
(17) 
.
The following theorem provides upper bounds for the expectation . Suppose that, for a fixed , there exists a positive constant such that . Clearly, this is a reasonable assumption for large according to Theorem 3. More specifically, according to Lemma 2, grows linearly with (because grows linearly with ).
Theorem 7.
and converge as approaches infinity. Specifically,
Proof.
We start by writing the Taylor expansion of the quadratic function at :
Since ,
and we can write
Taking the expected value of both sides
(18) 
On the other hand, we have the following inequalities at the optimum points of and
(21) 
and
(22) 
It is easy to check that, by combining (19), (20), (21) and (22), we can arrive at the following inequality:
or equivalently,
(23) 
Taking the expected value, we get
(24) 
From Lemma 5 we know that
Therefore,
(25) 
Note that is inversely proportional to and grows linearly with . Furthermore, using (12) for a constant probability, it can be shown that increases linearly with , making the ratio arbitrarily small for .
Finally, we show that after projection onto , where is a positive constant, the upper bound of the estimation error would still approach zero for large . By noticing that , this projection can be written as
It is easy to show that
Using and , lower bounds for and can be computed as
Therefore
Using Lemma 2 and other wellestablished concentration inequalities, one can find stochastic upper bounds for quantities above. However, given that , , and scale linearly with , we can safely conclude that the estimation error remains bounded by an arbitrarily small number as approaches infinity. Moreover, intuitively speaking, the norm of and tends to increase before projection (which is the reason for bounding the dictionary in the first place) and the ratios and are likely to be smaller than one, resulting in a decrease in the estimation error.
Iv The Main Algorithm
The employed algorithm for DLM is based on (1). However, as explained below, we introduce several modifications to decrease the computational complexity and speed up the convergence. Similar to other DL algorithms, such as [4], the proposed algorithm consists of two stages that are called the sparse coding stage and the dictionary update stage. We describe them individually in the following subsections.
Iva The sparse coding stage
As we mentioned in Section IIB, the basis pursuit (exact) CS recovery is the limit of the Lasso solution as approaches zero [31]. However, a truly sparse and exact representation is usually not possible using any dictionary with a finite size. As a result, in sparse recovery of natural images, is usually selected to be a small number rather than zero even in noiseless scenarios [17, 16]. Our algorithm starts from a coarse and overly sparse representation, by selecting the initial to be large, and gradually reduces until the desired balance between the total error sum of squares and the sparsity is achieved. The idea behind this modification is that the initial dictionary is suboptimal and not capable of giving an exact sparse representation. However, as the iterations pass, the dictionary becomes closer to the optimal dictionary and must be decreased to get a sparse representation that closely adheres to the measurements.
Initializing the counter at and starting from an initial dictionary , the sparse coding stage consists of performing the following optimization:
(27) 
We deploy an exponential decay for :
(28) 
According to (28), is decreased from to in iterations and stays fixed at henceforth. For an exact recovery, is seemingly a plausible choice. However, for the reasons mentioned earlier, we set to a very small but nonzero value that is specified in the simulations section.
IvB The dictionary update stage
The quadratic optimization problem of (16) can be computationally inefficient to solve. More specifically, solving (16) in one step requires computing the inverse of (if it exists) which has roughly a time complexity of and is a memory intensive operation. The strategy that we employ in this paper, similar to what was proposed in [3, 24, 18], is to perform a gradient descent step:
(29) 
where can be computed efficiently:
(30) 
The step size can be iteratively decreased with [18] or it can be optimized in a steepest descent fashion that is described below^{7}^{7}7If is wellconditioned, a single step of steepest descent can give a close approximation of the the solution of (16).
The optimal value of the step size can be computed using a simple line search [19]. However, for a quadratic objective function, we can derive in a closed form as shown below. Let . Then,
Writing the optimality conditions for the objective function above, we can arrive at the following solution for :
(31) 
Since the initial dictionary consists of unitnorm columns, . As discussed in Section II, DL results in an unbounded dictionary if no constraint is put on the dictionary norm or the norm of its columns. Here, we employ a Frobenius bound on because it lets different dictionary columns have distinct norms. Specifically, after each update , we ensure the constraint . This is done by multiplying with . Algorithm 1 gives a summary of these steps.
Comments
There are no comments yet.