Non-negative Matrix Factorizations

Here I present two applications of non-negative matrix factorizations: image analysis and textual analysis...

David AustinDavid Austin
Grand Valley State University
Email David Austin

Introduction

Making sense of a large data set, one with perhaps millions of data points, is a common problem these days. We would like some way to reduce the data so that common features, trends, or correlations become apparent.

Mathematics provides many powerful tools for addressing this problem. In this column, we'll look at one approach that uses non-negative matrix factorizations, which were described by Lee and Seung in the late 1990s and which are finding increasing use in machine-learning applications.

To get a feel for the idea, let's begin with an application Lee and Seung considered in their original Nature paper. A well-known data set is a database of 2429 facial images, each of which consists of a 19$\times$19 array of grayscale pixels. Here are some typical images.

We will represent each of these images as a 19$\cdot$19 = 361-dimensional vector and form a matrix $V$ whose columns are the 2429 images. The dimensions of $V$ are therefore 361$\times$2429, and our goal is to find common features among the 2429 columns of $V$.

Since the entries of $V$ are non-negative, a non-negative matrix factorization factors $V$ as $$ V = WH $$ where the entries in both $W$ and $H$ are non-negative and where $W$ has fewer columns that $V$. Lee and Seung chose $W$ to be a 361$\times$49 matrix so that $H$ is necessarily 49$\times$2429. The dimensions in the factorization may be represented schematically as:

In general, we will not be able to factor $V=WH$ exactly; the rank of $V$ is most likely close to 361, whereas the rank of $W$ can be no more than 49. Instead, we find the best approximation $$ V\approx WH $$ as will be explained later.

Each column of $W$ is a 361-dimensional vector and, since the entries are non-negative, we may view a column as a 19$\times$19 image. Here are the 49 images formed by the columns of $W$, brightened somewhat to increase the contrast. (Though these images are arranged in a rectangular grid, you should imagine them arranged horizontally as they represent the columns of $W$.)

To understand this better, let's look at a single image in the database, one corresponding to a column $V_j$. We see that this image can be written (approximately) as a linear combination of the images formed by the columns of $W$: $$ V_j \approx \sum_k W_kH_{kj}. $$ In other words, the image represented by $V_j$ is obtained by adding the images represented by $W_k$ with weights $H_{kj}$.

For example, the image on the left below is from the image database. The image on the right is that reconstructed from the matrix factorization.

Reconstructing this image uses 49 weights $H_{kj}$ that are visually represented below. Pure black represents a weight of 0 while lighter shades have larger weights. Each of the weights multiplies its corresponding column in $W$ represented in the grid of images above.

A few of the highest weights in the reconstruction are:

= 0.70 + 0.51
  + 0.49 +0.48
  +0.48 + $\ldots$

The benefit of the matrix factorization is this: Because the entries in $W$ are non-negative, we may think of the columns as forming images. Because the entries in $H$ are non-negative, we reconstruct images in the database by adding together images represented by the columns of $W$ with various weights.

Consequently, we may think of the images represented by the columns of $W$ to be common features that frequently appear in the database images. For instance,

looks like it represents a lighter region corresponding to cheeks. When reconstructing the database images, the factorization decides how much of this feature to mix in to reconstruct each database image. Other features look like parts of a nose or regions around the lips.

The matrix factorization detects these common features and provides a recipe for mixing them together to reconstruct the original data set.

In general, we begin with an $m\times n$ data matrix $V$ and factor $V\approx WH$ where $W$ is $m\times r$ and $H$ is $r\times n$. The factorization, in this case, identifies $r$ common features in the data set. But how should we choose $r$? For instance, why did Lee and Seung choose 49 for the number of columns of $W$?

There is really no clear-cut answer. One approach begins by noticing that we are creating a rank $r$ approximation to the rank $n$ data matrix $V$. We may examine the decreasing sequence of singular values of $V$ and choose $r$ to be at a point where the singular values appear sufficiently small. We could also find non-negative factorizations for various values of $r$ and choose one that seems most useful. In some applications, the nature of the problem leads to a natural choice for the number of features $r$.

Finding factorizations

Lee and Seung describe an ingenious method for forming non-negative matrix factorizations. We begin with an $m\times n$ data matrix $V$ and choose a rank $r$. Our goal is to approximately factor $V\approx WH$ where $W$ is an $m\times r$ matrix and $H$ is $r\times n$. We therefore seek matrices $W$ and $H$ that minimize a least-squares objective function $$ F(W,H) = \frac12\|{V - WH}\|^2. $$ It is important to remember that we have the additional constraint that the entries in $W$ and $H$ be non-negative: $W\geq 0$ and $H\geq 0$.

The first thing to note is that there is not a unique solution to this problem. If $W$ and $H$ minimize $F$ and $P$ is an invertible $r\times r$ matrix, then $WP$ and $P^{-1}H$ also minimize the objective since $(WP)(P^{-1}H) = WH$. For example, we may scale a column of $W$ to meet some desirable criterion so long as we appropriately scale the corresponding row of $H$.

Seeking a minimum $(W,H)$ of $F$, we could look for a critical point $\nabla F(W,H)=0$; solving this non-linear equation, however, is not feasible. Instead, our strategy for finding $W$ and $H$ is to randomly choose initial matrices $W^0\geq0$ and $H^0\geq0$ and create a sequence $(W^i,H^i)$ that continually decreases the objective function $F$. In particular, we will create an update rule $U(W,H)$ so that $$ (W^{i+1}, H^{i+1}) = U(W^i, H^i). $$ A conventional update strategy is to simply move in the direction of $-\nabla F(W, H)$ using the update rule $$ (W^{i+1}, H^{i+1}) = U(W^i, H^i) = (W^i, H^i) - \eta \nabla F(W^i,H^i) \\ $$ for some appropriately chosen scalar $\eta$, usually called the learning rate in machine-learning applications. This process is known as gradient descent.

The problem with this approach is that the update rule may cause some of the entries in $W$ or $H$ to become negative violating our constraints $W\geq0$ and $H\geq0$. To address this issue, Lee and Seung use a modified gradient descent algorithm, which we will describe. We will begin by fixing $W^i$ and updating $H^i$ so that $$ F(W^i,H^{i+1}) \leq F(W^i, H^i) $$ and then updating $W^i$ with $H^{i+1}$ fixed so that $$ F(W^{i+1}, H^{i+1}) \leq F(W^i,H^{i+1}) \leq F(W^i, H^i). $$ We will describe the first step in which $H^i$ is updated so we assume that $W=W^i$ is fixed.

We will see that the columns of $H$ do not interact with one another so it is enough to focus on a single column $h$ at a time. Therefore, we consider the objective function $$ F(h) = \frac12 \|v - Wh\|^2 $$ where $v$ and $h$ are corresponding columns of $V$ and $H$. If $h^i$ is a column of $H^i$, remembering that $\|x\|^2 = x^Tx$ shows us that $$ \begin{aligned} F(h) & = F(h^i + (h - h^i)) \\ & = \frac12 \|v - Wh^i - W(h-h^i)\|^2 \\ & = F(h^i) - (h-h^i)^TW^T(v-Wh^i) + \frac12 (h-h^i)^T(W^TW)(h-h^i) \\ \end{aligned} $$

An auxiliary function: Lee and Seung introduce an auxiliary function, $$ G_i(h) = F(h^i) - (h-h^i)^TW^T(v-Wh^i) + \frac12 (h-h^i)^TK(h^i)(h-h^i) $$ where $K(h^i)$ is the diagonal matrix: $$K_{aa} = \frac{(W^TWh^i)_a}{h^i_a}. $$ This function has two convenient properties. First, $G_i(h^i) = F(h^i)$. Second, noting the similarity of $G_i(h)$ and $F(h)$, we see that $G_i(h)-F(h)$ is the quadratic form: $$ G_i(h) - F(h) = \frac12(h-h^i)^T\left[K(h^i) - W^TW\right](h-h^i). $$ A straightforward calculation shows that $K(h^i) - W^TW$ is positive semi-definite so that $G_i(h)- F(h)\geq 0$ or $$ G_i(h) \geq F(h). $$ To summarize, our auxiliary function satisfies the properties: $$ \begin{aligned} G_i(h^i) & = F(h^i), \\ G_i(h) & \geq F(h). \end{aligned} $$

Minimizing the auxiliary function: The advantage of the auxiliary function is that we can find its minimum using the gradient: $$ \nabla G_i(h) = -W^T(v-Wh^i) + K(h^i)(h-h^i). $$ The update $h^{i+1}$ will be the critical point $\nabla G_i(h^{i+1}) = 0$. Remembering that $K$ is a diagonal matrix, we see that $\nabla G_i(h^{i+1}) = 0$ implies that $$ \begin{aligned} 0 & = -W^T(v-Wh^i) + K(h^i)(h^{i+1}-h^i) \\ h^{i+1} & = h^i + K(h^i)^{-1}\left[W^Tv - W^TWh^i\right] \\ \end{aligned} $$ Let's break this expression down into components: $$ \begin{aligned} h^{i+1}_a & = h^i_a + K(h^i)_{aa}^{-1}\left((W^Tv)_a - (W^TWh^i)_a\right) \\ & = h^i_a + \frac{h^i_a}{(W^TWh^i)_a}\left((W^Tv)_a - (W^TWh^i)_a\right) \\ & = \frac{(W^Tv)_a}{(W^TWh^i)_a}h^i_a \\ \end{aligned} $$ This gives the update rule: $$ h^{i+1}_a = \left(\frac{(W^Tv)_a}{(W^TWh^i)_a}\right) h^i_a. $$ Notice that this update rule is multiplicative: $h^i_a$ is multiplied by some factor and, because the entries in $W$, $H$, and $V$ are non-negative, this factor is non-negative. Our multiplicative update rule therefore respects the condition $H\geq0$.

The update doesn't increase the objective function: Of course, this update minimizes $G_i(h)$ so that $G_i(h^{i+1}) \leq G_i(h)$ for all other $h$. While this is not the same as our original goal of minimizing $F(h)$, we see that the update rule does not increase $F$. Remembering that $F(h) \leq G_i(h)$ for all $h$ and that $G_i(h^i) = F(h^i)$, we see that $$ F(h^{i+1}) \leq G_i(h^{i+1}) \leq G_i(h^i) = F(h^i). $$

Putting everything together: A similar argument leads to a multiplicative update rule for $W$ so that we have: $$ \begin{aligned} U^H(W, H)_{ab} & = \left(\frac{(W^TV)_{ab}} {(W^TWH)_{ab}}\right) H_{ab} \\ U^W(W, H)_{ab} & = \left(\frac{(VH^T)_{ab}} {(WHH^T)_{ab}}\right) W_{ab}. \\ \end{aligned} $$ As mentioned earlier, we update $(W^i, H^i)$ to $(W^{i+1}, H^{i+1})$ by first applying $$ H^{i+1} = U^H(W^i, H^i) $$ and then $$ W^{i+1} = U^W(W^i, H^{i+1}). $$

One consequence of this algorithm is worth noting. Applying the update rules will decrease the objective function $F$ while respecting the boundary conditions $W\geq 0$ and $H\geq 0$. Consequently, the sequence of matrices $(W^i, H^i)$ will often converge to a boundary point meaning that some entries in $W$ and $H$ will become zero. Due to the multiplicative nature of the update rules, once an entry becomes zero, it will stay zero. Therefore, the final matrices $(W,H)$ typically have many zero entries.

This is apparent when looking at the images above and noting that large areas of the images defined by the columns of $W$ are black, which represents a zero entry. Likewise, many entries in the columns of $H$ are zero as well. This means that the images in the database are approximated by adding together only a subset of the feature images. Therefore, the images formed by the columns of $W$ typically represent a small feature. Every feature appears in some faces, but not every feature appears in every face.

Classifying Wikipedia pages

As another illustration of this technique, I looked at the 1000 most popular Wikipedia pages in 2018 and removed those not earning a "good" rating. ("Good" pages are those that meet certain editorial standards and exclude, say, the Wikipedia home page and pages about other web sites.) That left 563 pages. I extracted the text from each page and performed some basic cleaning by removing punctuation and applying a stemming algorithm so that related words such as "running", "ran", and "run" are considered equivalent. After removing a standard list of common words, such as "and", "the", and "also," there remained a set of 2345 words in the 563 pages.

For each page, I created a 2345-dimensional vector describing the frequency at which a given word appears in that page; that is, each component of the vector is the number of times a given word appears divided by the total number of words. This is known as a bag of words representation of the page since it ignores the placement of words near one another and the order in which they appear.

I then put the pages together to form a $2345\times563$ data matrix $V$ and formed a non-negative matrix factorization $V\approx WH$ where $W$ is a $2345\times10$ matrix. Any of the 10 columns of $W$ can be interpreted as a common distribution of words appearing in the pages. We may therefore think of a column of $W$ as a "topic." Given one of the 563 pages, the factorization describes how the page is formed by mixing together various topics.

For each of the columns of $W$, I made a word cloud, like the one below, illustrating the distribution of words. This particular topic tells us that one of the popular subjects for Wikipedia readers in 2018 was the Marvel universe.

The matrix $H$ also gives us useful information for classifying pages. Each column of $H$ corresponds to one of the 563 pages and each of the 10 entries in a column tells us how much a given topic contributes to the page. For a given page, we can find the largest component of that page's column of $H$ to determine which topic is the most important one in that page.

There are 33 pages for which the Marvel topic is the most important. These include pages for Avengers: Infinity War, Black Panther, Robert Downey Jr., and Iron Man 2.

Other topics are:

 

 

Geography, whose pages include China, Montreal, England, and Mumbai.

Movies, whose pages include Clint Eastwood, Saving Private Ryan, Emma Thompson, and Mila Kunis.

Music, whose pages include Kanye West, Amy Winehouse, Nina Simone, and Mick Jagger.

Health, whose pages include Benzoidazepine, Smallpox, HIV, and Plantar facsiitis.

Science and math, whose pages include Srinivasa Ramanujan, Python (programming language), Pythagorean theorem, and NASA. Regrettably, there are only 13 pages in this category though one of them is Taylor Series.

Sports, whose pages include Lionel Messi, LeBron James, Roger Federer, and New England Patriots.

 

The other topics could be called Government and Politics, History, and Celebrities and historical figures.

While the results seemed to be, in general, quite good, there were a few things that seemed not quite right. For instance, Mr. Bean and Star Trek ended up under the Health topic. A tenth topic had only two pages, Manhattan and New Zealand, and this situation persisted when I varied the number of topics.

The columns of $W$ form a 10-dimensional subspace of a 2345-dimensional vector space that, in some sense, best approximates the data. Given another Wikipedia page not in our original set, we can successively apply the update rules to describe that page as a linear combination of topics. This would tell us the important topics in that page and provide a means of classifying it. This illustrates how non-negative matrix factorizations appear in machine-learning applications.

Summary

We've looked at two applications of non-negative matrix factorizations here--image analysis and textual analysis--but there are many more. For instance, astronomers have applied this technique to the search for exoplanets. Due to its parts-based, additive nature, a non-negative matrix factorization facilitates the separation of starlight and light scattered by exoplanets. Additional areas of application include music analysis, cancer discovery, and community detection.

There are, of course, other methods for the kind of dimensional reduction formed by non-negative matrix factorizations. Lee and Seung compare this technique to vector quantization and principal component analysis, both of which can be explained in terms of matrix factorizations $V=WH$.

In vector quantization, the columns of $H$ are required to be unary, meaning that there is a single nonzero entry, which is one. In this way, every column in $V$ is approximated by a single column of $W$. When studying the database of facial images, for example, every column of $W$ would itself be a representative face. The factorization has the effect of assigning one of these representatives to each facial image in the database.

Principal component analysis requires that the columns of $W$ be orthonormal and the rows of $H$ to be orthogonal to one another. The entries in $W$ and $H$ can be either positive or negative, and the columns of $W$ no longer have an intuitive meaning.

By contrast, non-negative matrix factorizations identify constituent features that appear as the columns of $W$ and describe how the data is assembled from these key features. As we saw when looking at the Wikipedia pages, the features that appear as the columns of $W$ often have intuitive meaning and the columns of $H$ naturally cluster the data.

In all of these techniques, we implicitly assume that a low-rank approximation of the data matrix $V$ is reasonable. Upon further reflection, however, this assumption may seem problematic. Choosing a matrix at random typically gives a full-rank matrix; in the space of matrices, those matrices whose rank is not full form a nowhere dense set. Why should we expect to find a good approximation to a data matrix in this small set?

A recent paper by Udell and Townsend examines this question and finds, assuming each column of the data matrix is associated to a latent variable, that there is a low-rank approximation in which every entry in the data matrix is approximated with a fixed absolute error. This result provides some justification for the widespread application of low-rank approximations.

References