Welcome to ShortScience.org! 
[link]
# Deep Convolutional Generative Adversarial Nets ## Introduction * The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN)  a topologically constrained variant of conditional GAN. * [Link to the paper](https://arxiv.org/abs/1511.06434) ## Benefits * Stable to train * Very useful to learn unsupervised image representations. ## Model * GANs difficult to scale using CNNs. * Paper proposes following changes to GANs: * Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators). * Remove fully connected hidden layers. * Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer). * Use LeakyReLU in all layers of the discriminator. * Use ReLU activation in all layers of the generator (except output layer which uses Tanh). ## Datasets * LargeScale Scene Understanding. * Imagenet1K. * Faces dataset. ## Hyperparameters * Minibatch SGD with minibatch size of 128. * Weights initialized with 0 centered Normal distribution with standard deviation = 0.02 * Adam Optimizer * Slope of leak = 0.2 for LeakyReLU. * Learning rate = 0.0002, β1 = 0.5 ## Observations * LargeScale Scene Understanding data * Demonstrates that model scales with more data and higher resolution generation. * Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD). * Classifying CIFAR10 dataset * Features * Train in Imagenet1K and test on CIFAR10. * Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids. * Flatten and concatenate to get a 28672dimensional vector. * Linear L2SVM classifier trained over the feature vector. * 82.8% accuracy, outperforms Kmeans (80.6%) * Street View House Number Classifier * Similar pipeline as CIFAR10 * 22.48% test error. * The paper contains many examples of images generated by final and intermediate layers of the network. * Images in the latent space do not show sharp transitions indicating that network did not memorize images. * DCGAN can learn an interesting hierarchy of features. * Networks seems to have some success in disentangling image representation from object representation. * Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman  normal woman + normal man = smiling man` visually. 
[link]
Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the firstthingtotry: 1. Build a recurrent generator model which samples from its softmax outputs at each timestep. 2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and realdata sequences. 3. Train the discriminator under the standard GAN loss. 4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator. Sounds hacky, right? We're learning a generator with a highvariance modelfree reinforcement learning algorithm, in a very seriously nonstationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.) There's just one trick in this paper on top of that setup: for nonterminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$) $$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$ The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$.  Yep, so it turns out that this sort of works.. with a big caveat: ## The big caveat Graph from appendix: ![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%2020160927%20um%2021.34.44.png?raw=1) SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1). I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.
4 Comments

[link]
This paper presents an endtoend version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results. The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite wellwritten, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful. 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any nonnegative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} 0.656 \\\ 0.652 \\\ 0.379 \end{array}\right], H = \left[\begin{array}{c c c} 6.48 & 6.26 & 3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a nonnegativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a nonnegative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft 