From this paper, my understanding is that NMF performance an unsupervised learning algorithm which decompose the data matrix into two nonnegative sub matrices. The final result is a compress of the original data matrix. "Each data vector $v$ can be generated by a linear combination of the columns of $W$, weighted by the components of $h$. Therefore $W$ can be regarded as containing a basis
that is optimized for the linear approximation of the data in $V$. Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieved." Here I regard this method can generates new features for current dataset and could be used in dimension reduction like PCA but with non-negative result.
To achieve the factorization, they proposed two cost functions, one is just ordinary euclidean distance between original dataset and the production of two factorized small matrices. The other one used the KL-divergence to perform the same goal. The only difference is the later one is not symmetric.
Despite the difference of cost function, the updating formula are similar. The idea, in my understanding, comes from the ordinary gradient descent. They connected the updating rule: $H_{av} = H_{av}+\eta_{a\mu}[(W^TV)_{a\mu}-(W^TWH)_{a\mu}]$ with gradient descent which $\eta_{a\mu}$ like a learning rate and the multiplier looks like a tiny difference or the first order of cost function at variable $H$.
Finally, they utilized the concept of auxiliary function and updating rule $h^{t+1} = arg\min_h G(h, h^t)$ to guarantee the nonincreasing of cost function $F$.