Algorithms for Non-negative Matrix Factorization on ShortScience.org

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 6

[link] Summary by Henry Z Lo 9 years ago

The paper introduces nonnegative matrix factorization, a technique which used in fields such as chemometrics.  The problem formulation is this:

$$
\underset{W,H}{\text{argmin}} ~ d(X, WH) \\\\
\text{s.t.} ~W_{ij}, H_{ij} \ge 0
$$

Where:
- $X \in \mathbb{R}^{n \times m}$ is a matrix of data, for example, $n$ samples of $m$ features.  Each element of $X$ is nonnegative, as are the elements of $W$ and $H$.
- $W \in \mathbb{R}^{n \times k}$ represents how each of the $n$ samples belong to each of the $k$ "clusters".
- $H \in \mathbb{R}^{k \times m}$ describes each of the $k$ clusters in terms of the $m$ variables.
- $d$ is some cost function, for example, sum of squared differences.

The non-negativity constraint means the clusters (represented by the rows $W$ of $W$) describe clusters in terms of what features are present.  This may make interpretation easier in some instances, but makes the optimization problem more difficult.

The paper mentions two loss functions, sum of squared error:

$$
d(X,WH) = \sum_{ij} |X_{ij}-(WH)_{ij}|^2
$$

 and a measure similar to unnormalized Kullback-Leibler divergence:
 
 $$
d(X,WH) = \sum_{ij} \left( X_{ij} \log \frac{X_{ij}}{(WH)_{ij}} - X_{ij}+(WH)_{ij} \right)
$$

For each of these objectives, multiplicative update rules are given.  For squared error:

$$
H_{ij} \leftarrow H_{ij} \frac{(W^TX)_{ij}}{(W^TWH)}_{ij} ~~~
W_{ij} \leftarrow W_{ij} \frac{(XH^T)_{ij}}{(WHH^T)}_{ij}
$$

And for divergence:

$$
H_{ij} \leftarrow H_{ij} \frac{\sum_a W_{ai} X_{aj} / (WH)_{aj}} { \sum_b W_{bj}} ~~~~
W_{ij} \leftarrow W_{ij} \frac{\sum_a H_{ja} X_{ia} / (WH)_{ia}} { \sum_b H_{jb}}
$$

These rules are applied alternatingly; fix $W$ and update $H$, then fix $H$ and update $W$.

These multiplicative updates are essentially a diagonally rescaled gradient descent.  The authors then prove that these update rules do not increase the objective.  Future authors have pointed out that not increasing the cost does not imply convergence; e.g. the parameters could stop updating, without having reached a minima.  However, a trivial fix to the multiplicative update rules (ensuring no division by zero, by making 0 elements slightly positive) alleviates these problems.

Your comment: