[link]
# Main Results (tl;dr) ## Deep *Linear* Networks 1. Loss function is **nonconvex** and nonconcave 2. **Every local minimum is a global minimum** 3. Shallow neural networks *don't* have bad saddle points 4. Deep neural networks *do* have bad saddle points ## Deep *ReLU* Networks * Same results as above by reduction to deep linear networks under strong simplifying assumptions * Strong assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. ## Highlighted Takeaways * Depth *doesn't* create nonglobal minima, but depth *does* create bad saddle points. * This paper moves deep linear networks closer to a good model for deep ReLU networks by discarding 5 of the 7 of the previously used assumptions. This gives more "support" for the conjecture that deep ReLU networks don't have bad local minima. * Deep linear networks don't have bad local minima, so if deep ReLU networks do have bad local minima, it's purely because of the introduction of nonlinear activations. This highlights the importance of the activation function used. * Shallow linear networks don't have bad saddles point while deep linear networks do, indicating that the saddle point problem is introduced with depth beyond the first hidden layer. Bad saddle point : saddle point whose Hessian has no negative eigenvalues (no direction to descend) Shallow neural network : single hidden layer Deep neural network : more than one hidden layer Bad local minima : local minima that aren't global minima # Position in Research Landscape * Conjecture from 1989: For deep linear networks, every local minimum is a global minimum: [Neural networks and principal component analysis: Learning from examples without local minima (Neural networks 1989)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.1839&rep=rep1&type=pdf) * This paper proves that conjecture. * Given 7 strong assumptions, the losses of local minima are concentrated in an exponentially (with dimension) tight band: [The Loss Surfaces of Multilayer Networks (AISTATS 2015)](https://arxiv.org/abs/1412.0233) * Discarding some of the above assumptions is an open problem: [Open Problem: The landscape of the loss surfaces of multilayer networks (COLT 2015)](http://proceedings.mlr.press/v40/Choromanska15.pdf) * This paper discards 5 of those assumptions and proves the result for a strictly more general deep nonlinear model class. # More Details ## Deep *Linear* Networks * Main result is Result 2, which proves the conjecture from 1989: every local minimum is a global minimum. * Not where the strong assumptions come in * Assumptions (realistic and practically easy to satisfy): * $XX^T$ and $XY^T$ are full rank * $d_y \leq d_x$ (output is lower dimension than input) * $\Sigma = YX^T(XX^T )^{−1}XY^T$ has $d_y$ distinct eigenvalues * specific to the squared error loss function * Essentially gives a comprehensive understanding of the loss surface of deep linear networks ## Deep ReLU Networks * Specific to ReLU activation. Makes strong use of its properties * Choromanska et al. (2015) relate the loss function to the Hamiltonian of the spherical spinglass model, using 3 reshaping assumptions. This allows them to apply existing random matrix theory results. This paper drops those reshaping assumptions by performing completely different analysis. * Because Choromanska et al. (2015) used random matrix theory, they analyzed a random Hessian, which means they need to make 2 distributional assumptions. This paper also drops those 2 assumptions and analyzes a deterministic Hessian. * Remaining Unrealistic Assumptions: * The probability that a path through the ReLU network is active is the same, agnostic to which path it is. * The activations of the network are independent of the input data and the weights. # Related Resources * [NIPS Oral Presentation](https://channel9.msdn.com/Events/NeuralInformationProcessingSystemsConference/NeuralInformationProcessingSystemsConferenceNIPS2016/DeepLearningwithoutPoorLocalMinima) 
[link]
* They present a variation of Faster RCNN, i.e. a model that predicts bounding boxes in images and classifies them. * In contrast to Faster RCNN, their model is fully convolutional. * In contrast to Faster RCNN, the computation per bounding box candidate (region proposal) is very low. ### How * The basic architecture is the same as in Faster RCNN: * A base network transforms an image to a feature map. Here they use ResNet101 to do that. * A region proposal network (RPN) uses the feature map to locate bounding box candidates ("region proposals") in the image. * A classifier uses the feature map and the bounding box candidates and classifies each one of them into `C+1` classes, where `C` is the number of object classes to spot (e.g. "person", "chair", "bottle", ...) and `1` is added for the background. * During that process, small subregions of the feature maps (those that match the bounding box candidates) must be extracted and converted to fixedsizes matrices. The method to do that is called "Region of Interest Pooling" (RoIPooling) and is based on max pooling. It is mostly the same as in Faster RCNN. * Visualization of the basic architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__architecture.jpg?raw=true "Architecture") * Positionsensitive classification * Fully convolutional bounding box detectors tend to not work well. * The authors argue, that the problems come from the translationinvariance of convolutions, which is a desirable property in the case of classification but not when precise localization of objects is required. * They tackle that problem by generating multiple heatmaps per object class, each one being slightly shifted ("positionsensitive score maps"). * More precisely: * The classifier generates per object class `c` a total of `k*k` heatmaps. * In the simplest form `k` is equal to `1`. Then only one heatmap is generated, which signals whether a pixel is part of an object of class `c`. * They use `k=3*3`. The first of those heatmaps signals, whether a pixel is part of the *top left* corner of a bounding box of class `c`. The second heatmap signals, whether a pixel is part of the *top center* of a bounding box of class `c` (and so on). * The RoIPooling is applied to these heatmaps. * For `k=3*3`, each bounding box candidate is converted to `3*3` values. The first one resembles the top left corner of the bounding box candidate. Its value is generated by taking the average of the values in that area in the first heatmap. * Once the `3*3` values are generated, the final score of class `c` for that bounding box candidate is computed by averaging the values. * That process is repeated for all classes and a softmax is used to determine the final class. * The graphic below shows examples for that: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__examples.jpg?raw=true "Examples") * The above described RoIPooling uses only averages and hence is almost (computationally) free. * They make use of that during the training by sampling many candidates and only backpropagating on those with high losses (online hard example mining, OHEM). * À trous trick * In order to increase accuracy for small bounding boxes they use the à trous trick. * That means that they use a pretrained base network (here ResNet101), then remove a pooling layer and set the à trous rate (aka dilation) of all convolutions after the removed pooling layer to `2`. * The á trous rate describes the distance of sampling locations of a convolution. Usually that is `1` (sampled locations are right next to each other). If it is set to `2`, there is one value "skipped" between each pair of neighbouring sampling location. * By doing that, the convolutions still behave as if the pooling layer existed (and therefore their weights can be reused). At the same time, they work at an increased resolution, making them more capable of classifying small objects. (Runtime increases though.) * Training of RFCN happens similarly to Faster RCNN. ### Results * Similar accuracy as the most accurate Faster RCNN configurations at a lower runtime of roughly 170ms per image. * Switching to ResNet50 decreases accuracy by about 2 percentage points mAP (at faster runtime). Switching to ResNet152 seems to provide no measureable benefit. * OHEM improves mAP by roughly 2 percentage points. * À trous trick improves mAP by roughly 2 percentage points. * Training on `k=1` (one heatmap per class) results in a failure, i.e. a model that fails to predict bounding boxes. `k=7` is slightly more accurate than `k=3`.
1 Comments

[link]
* Usually GANs transform a noise vector `z` into images. `z` might be sampled from a normal or uniform distribution. * The effect of this is, that the components in `z` are deeply entangled. * Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image. * The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color. * They suggest a change to GANs based on Mutual Information, which leads to interpretable components. * E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...). * These components are learned in a (mostly) unsupervised fashion. ### How * The latent code `c` * "Normal" GANs parameterize the generator as `G(z)`, i.e. G receives a noise vector and transforms it into an image. * This is changed to `G(z, c)`, i.e. G now receives a noise vector `z` and a latent code `c` and transforms both into an image. * `c` can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness. * Mutual Information * If using a latent code via `G(z, c)`, nothing forces the generator to actually use `c`. It can easily ignore it and just deteriorate to `G(z)`. * To prevent that, they force G to generate images `x` in a way that `c` must be recoverable. So, if you have an image `x` you must be able to reliable tell which latent code `c` it has, which means that G must use `c` in a meaningful way. * This relationship can be expressed with mutual information, i.e. the mutual information between `x` and `c` must be high. * The mutual information between two variables X and Y is defined as `I(X; Y) = entropy(X)  entropy(XY) = entropy(Y)  entropy(YX)`. * If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round). * If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round). * The new GAN loss becomes `old loss  lambda * I(G(z, c); c)`, i.e. the higher the mutual information, the lower the result of the loss function. * Variational Mutual Information Maximization * In order to minimize `I(G(z, c); c)`, one has to know the distribution `P(cx)` (from image to latent code), which however is unknown. * So instead they create `Q(cx)`, which is an approximation of `P(cx)`. * `I(G(z, c); c)` is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN"). * Basic equation: `LowerBoundOfMutualInformation(G, Q) = E[log Q(cx)] + H(c) <= I(G(z, c); c)` * `c` is the latent code. * `x` is the generated image. * `H(c)` is the entropy of the latent codes (constant throughout the optimization). * Optimization w.r.t. Q is done directly. * Optimization w.r.t. G is done via the reparameterization trick. * If `Q(cx)` approximates `P(cx)` *perfectly*, the lower bound becomes the mutual information ("the lower bound becomes tight"). * In practice, `Q(cx)` is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q. ### Results * MNIST * They use for `c` one categorical variable (10 values) and two continuous ones (uniform between 1 and +1). * InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width. * Applying Q(cx) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy. * Sampling new images with exaggerated continuous variables in the range `[2,+2]` yields sound images (i.e. the network generalizes well). * ![MNIST examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__mnist.png?raw=true "MNIST examples") * 3D face images * InfoGAN learns to represent the faces via pose, elevation, lighting. * They used five uniform variables for `c`. (So two of them apparently weren't associated with anything sensible? They are not mentioned.) * 3D chair images * InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments). * They used one categorical variable (four values) and one continuous variable (uniform `[1, +1]`). * SVHN * InfoGAN learns to represent lighting and to spot the center digit. * They used four categorical variables (10 values each) and two continuous variables (uniform `[1, +1]`). (Again, a few variables were apparently not associated with anything sensible?) * CelebA * InfoGAN learns to represent pose, presence of sunglasses (not perfectly), hair style and emotion (in the sense of "smiling or not smiling"). * They used 10 categorical variables (10 values each). (Again, a few variables were apparently not associated with anything sensible?) * ![CelebA examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__celeba.png?raw=true "CelebA examples") 
[link]
* They present a hierarchical method for reinforcement learning. * The method combines "long"term goals with shortterm action choices. ### How * They have two components: * MetaController: * Responsible for the "long"term goals. * Is trained to pick goals (based on the current state) that maximize (extrinsic) rewards, just like you would usually optimize to maximize rewards by picking good actions. * The MetaController only picks goals when the Controller terminates or achieved the goal. * Controller: * Receives the current state and the current goal. * Has to pick a reward maximizing action based on those, just as the agent would usually do (only the goal is added here). * The reward is intrinsic. It comes from the Critic. The Critic gives reward whenever the current goal is reached. * For Montezuma's Revenge: * A goal is to reach a specific object. * The goal is encoded via a bitmask (as big as the game screen). The mask contains 1s wherever the object is. * They handextract the location of a few specific objects. * So basically: * The MetaController picks the next object to reach via a Qvalue function. * It receives extrinsic reward when objects have been reached in a specific sequence. * The Controller picks actions that lead to reaching the object based on a Qvalue function. It iterates actionchoosing until it terminates or reached the goalobject. * The Critic awards intrinsic reward to the Controller whenever the goalobject was reached. * They use CNNs for the MetaController and the Controller, similar in architecture to the AtariDQN paper (shallow CNNs). * They use two replay memories, one for the MetaController (size 40k) and one for the Controller (size 1M). * Both follow an epsilongreedy policy (for picking goals/actions). Epsilon starts at 1.0 and is annealed down to 0.1. * They use a discount factor / gamma of 0.9. * They train with SGD. ### Results * Learns to play Montezuma's Revenge. * Learns to act well in a more abstract MDP with delayed rewards and where simple Qlearning failed.  # Rough chapterwise notes * (1) Introduction * Basic problem: Learn goal directed behaviour from sparse feedbacks. * Challenges: * Explore state space efficiently * Create multiple levels of spatiotemporal abstractions * Their method: Combines deep reinforcement learning with hierarchical value functions. * Their agent is motivated to solve specific intrinsic goals. * Goals are defined in the space of entities and relations, which constraints the search space. * They define their value function as V(s, g) where s is the state and g is a goal. * First, their agent learns to solve intrinsically generated goals. Then it learns to chain these goals together. * Their model has two hiearchy levels: * MetaController: Selects the current goal based on the current state. * Controller: Takes state s and goal g, then selects a good action based on s and g. The controller operates until g is achieved, then the metacontroller picks the next goal. * MetaController gets extrinsic rewards, controller gets intrinsic rewards. * They use SGD to optimize the whole system (with respect to reward maximization). * (3) Model * Basic setting: Action a out of all actions A, state s out of S, transition function T(s,a)>s', reward by state F(s)>R. * epsilongreedy is good for local exploration, but it's not good at exploring very different areas of the state space. * They use intrinsically motivated goals to better explore the state space. * Sequences of goals are arranged to maximize the received extrinsic reward. * The agent learns one policy per goal. * MetaController: Receives current state, chooses goal. * Controller: Receives current state and current goal, chooses action. Keeps choosing actions until goal is achieved or a terminal state is reached. Has the optimization target of maximizing cumulative reward. * Critic: Checks if current goal is achieved and if so provides intrinsic reward. * They use deep Q learning to train their model. * There are two Qvalue functions. One for the controller and one for the metacontroller. * Both formulas are extended by the last chosen goal g. * The Qvalue function of the metacontroller does not depend on the chosen action. * The Qvalue function of the controller receives only intrinsic direct reward, not extrinsic direct reward. * Both Qvalue functions are reprsented with DQNs. * Both are optimized to minimize MSE losses. * They use separate replay memories for the controller and metacontroller. * A memory is added for the metacontroller whenever the controller terminates. * Each new goal is picked by the metacontroller epsilongreedy (based on the current state). * The controller picks actions epsilongreedy (based on the current state and goal). * Both epsilons are annealed down. * (4) Experiments * (4.1) Discrete MDP with delayed rewards * Basic MDP setting, following roughly: Several states (s1 to s6) organized in a chain. The agent can move left or right. It gets high reward if it moves to state s6 and then back to s1, otherwise it gets small reward per reached state. * They use their hierarchical method, but without neural nets. * Baseline is Qlearning without a hierarchy/intrinsic rewards. * Their method performs significantly better than the baseline. * (4.2) ATARI game with delayed rewards * They play Montezuma's Revenge with their method, because that game has very delayed rewards. * They use CNNs for the controller and metacontroller (architecture similar to the AtariDQN paper). * The critic reacts to (entity1, relation, entity2) relationships. The entities are just objects visible in the game. The relation is (apparently ?) always "reached", i.e. whether object1 arrived at object2. * They extract the objects manually, i.e. assume the existance of a perfect unsupervised object detector. * They encode the goals apparently not as vectors, but instead just use a bitmask (game screen heightand width), which has 1s at the pixels that show the object. * Replay memory sizes: 1M for controller, 50k for metacontroller. * gamma=0.99 * They first only train the controller (i.e. metacontroller completely random) and only then train both jointly. * Their method successfully learns to perform actions which lead to rewards with long delays. * It starts with easier goals and then learns harder goals. 
[link]
* Weight Normalization (WN) is a normalization technique, similar to Batch Normalization (BN). * It normalizes each layer's weights. ### Differences to BN * WN normalizes based on each weight vector's orientation and magnitude. BN normalizes based on each weight's mean and variance in a batch. * WN works on each example on its own. BN works on whole batches. * WN is more deterministic than BN (due to not working an batches). * WN is better suited for noisy environment (RNNs, LSTMs, reinforcement learning, generative models). (Due to being more deterministic.) * WN is computationally simpler than BN. ### How its done * WN is a module added on top of a linear or convolutional layer. * If that layer's weights are `w` then WN learns two parameters `g` (scalar) and `v` (vector, identical dimension to `w`) so that `w = gv / v` is fullfilled (`v` = euclidean norm of v). * `g` is the magnitude of the weights, `v` are their orientation. * `v` is initialized to zero mean and a standard deviation of 0.05. * For networks without recursions (i.e. not RNN/LSTM/GRU): * Right after initialization, they feed a single batch through the network. * For each neuron/weight, they calculate the mean and standard deviation after the WN layer. * They then adjust the bias to `mean/stdDev` and `g` to `1/stdDev`. * That makes the network start with each feature being roughly zeromean and unitvariance. * The same method can also be applied to networks without WN. ### Results: * They define BNMEAN as a variant of BN which only normalizes to zeromean (not unitvariance). * CIFAR10 image classification (no data augmentation, some dropout, some white noise): * WN, BN, BNMEAN all learn similarly fast. Network without normalization learns slower, but catches up towards the end. * BN learns "more" per example, but is about 16% slower (timewise) than WN. * WN reaches about same test error as no normalization (both ~8.4%), BN achieves better results (~8.0%). * WN + BNMEAN achieves best results with 7.31%. * Optimizer: Adam * Convolutional VAE on MNIST and CIFAR10: * WN learns more per example und plateaus at better values than network without normalization. (BN was not tested.) * Optimizer: Adamax * DRAW on MNIST (heavy on LSTMs): * WN learns significantly more example than network without normalization. * Also ends up with better results. (Normal network might catch up though if run longer.) * Deep Reinforcement Learning (Space Invaders): * WN seemed to overall acquire a bit more reward per epoch than network without normalization. Variance (in acquired reward) however also grew. * Results not as clear as in DRAW. * Optimizer: Adamax ### Extensions * They argue that initializing `g` to `exp(cs)` (`c` constant, `s` learned) might be better, but they didn't get better test results with that. * Due to some gradient effects, `v` currently grows monotonically with every weight update. (Not necessarily when using optimizers that use separate learning rates per parameters.) * That grow effect leads the network to be more robust to different learning rates. * Setting a small hard limit/constraint for `v` can lead to better test set performance (parameter updates are larger, introducing more noise). ![CIFAR10 results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__cifar10.png?raw=true "CIFAR10 results") *Performance of WN on CIFAR10 compared to BN, BNMEAN and no normalization.* ![DRAW, DQN results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__draw_dqn.png?raw=true "DRAW, DQN results") *Performance of WN for DRAW (left) and deep reinforcement learning (right).* 
[link]
This paper proposed a class of loss functions applicable to image generation that are based on distance in feature spaces: $$\mathcal{L} = \lambda_{feat}\mathcal{L}_{feat} + \lambda_{adv}\mathcal{L}_{adv} + \lambda_{img}\mathcal{L}_{img}$$ ### Key Points  Using only l2 loss in image space yields oversmoothed results since it leads to averaging all likely locations of details.  L_feat measures the distance in suitable feature space and therefore preserves distribution of fine details instead of exact locations.  Using only L_feat yields bad results since feature representations are contractive. Many nonnatural images also mapped to the same feature vector.  By introducing a natural image prior  GAN, we can make sure that samples lie on the natural image manifold. ### Model https://i.imgur.com/qNzMwQ6.png ### Exp  Training Autoencoder  Generate images using VAE  Invert feature ### Thought I think the experiment section is a little complicated to comprehend. However, the proposed loss seems really promising and can be applied to many tasks related to image generation. ### Questions  Section 4.2 & 4.3 are hard to follow for me, need to pay more attention in the future 
[link]
This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand. By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect. ### Key Points  DGN (natural image prior) generates more coherent images when optimizing fullyconnected layer codes instead of lowlevel codes. However, previous studies showed that lowlevel features results in better reconstructions beacuse it contains more image details. The difference is that here DGNAM is trying to synthesize an entire layer code from scratch. Features in lowlevel only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more highdimensional, making it harder to optimize.  The learned prior trained on ImageNet can also generalize to Places.  It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.  The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.  When visualizing hidden neurons, DGNAM trained on ImageNet also generalize to Places and produce similar results as [1].  The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer. ### Model ![](https://cloud.githubusercontent.com/assets/7057863/21002626/b094d7aebd6111e68c95fd4931648426.png) ### Thought Solid paper with diverse visualizations and thorough analysis. ### Reference [1] Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al. 
[link]
This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to transfer discriminative visual knowledge from wellestablished visual recognition models into the sound modality. The learned sound representation yields significant performance improvements on standard benchmarks for acoustic scene classification task. ### Key Points  The natural synchronization between vision and sound can be leveraged as a supervision signal for each other.  Crossmodal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well.  In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance. ### Model  The authors proposed a studentteacher training procedure to transfer discriminative visual knowledge from visual recognition models trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions. ![](https://cloud.githubusercontent.com/assets/7057863/20856609/05fe12d6b94e11e68c92995ee84fe0d7.png)  Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higherlevel concepts. ### Exp  Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%.  Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8layer networks performs poorly due to overfitting.  Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%.  Multimodal recognition models use visual and sound data together yields 2% gain in classification accuracy. ### Thought I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements. ### Questions  Although paper said that "To handle variabletemporallength of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds.  It looks not clear for me about the data augmentation technique used in training. 
[link]
The authors presented a new generative model that learns to disentangle the factors of variations of the data. The authors claim that the proposed model is pretty robust to supervision. This is achieved by combining two of the most successful generative models: VAE and GAN. The model is able to resolve the analogies in a consistent way on several datasets with minimal parameter/architecture tunning. This paper presents a way to learn latent codes for data, that captures both the information relevant for a given classification task, as well as the remaining irrelevant factors of variation (rather than discarding the latter as a classification model would). This is done by combining a VAEstyle generative model, and adversarial training. This model proves capable of disentangling style and content in images (without explicit supervision for style information), and proves useful for analogy resolution. This paper introduces a generative model for learning to disentangle hidden factors of variation. The disentangling separates the code into two, where one is claimed to be the code that descries factors relevant to solving a specific task, and the other describing the remaining factors. Experimental results show that the proposed method is promising. The authors combine state of the art methods VAE and GAN to generate images with two complementary codes: one relevant and one irrelevant. They major contribution of the paper is the development of a training procedure that exploits triplets of images (two sharing the relevant code, one note sharing) to regularize the encoderdecoder architecture and avoid trivial solutions. The results are qualitatively good and comparable to previous article using more sources of supervision. Paper seeks to explore the variations amongst samples which separate multiple classes using auto encoders and decoders. Specifically, the authors propose combining generative adversarial networks and variational auto encoders. The idea mimics the game play between two opponents, where one attempts to fool the other into believing a synthetic sample is in fact a natural sample. The paper proposes an iterative training procedure where a generative model was first trained on a number of samples while keeping the weights of the adversary constant and later the adversary is trained while keeping the generative model weights constant. The paper performs experiments on generation of instances between classes, retrieval of instances belonging to a given class, and interpolation of instances between two classes. The experiments were performed on MNIST, a set of 2D character animation sprites, and 2D NORB toy image dataset. 
[link]
The Authors provide a bag of tricks for training GAN's in the image domain. Using these, they achieve very strong semisupervised results on SHVN, MNIST, and CIFAR. The authors then train the improved model on several images datasets, evaluate it on different tasks: semisupervised learning, and generative capabilities, and achieve stateoftheart results. This paper investigates several techniques to stabilize GAN training and encourage convergence. Although lack of theoretical justification, the proposed heuristic techniques give betterlooking samples. In addition to human judgement, the paper proposes a new metric called Inception score by applying pretrained deep classification network on the generated samples. By introducing free labels with the generated samples as new category, the paper proposes the experiment using GAN under semisupervised learning setting, which achieve SOTA semisupervised performance on several benchmark datasets (MNIST, CIFAR10, and SVHN). 
[link]
The paper proposes a "neural transducer" model for sequencetosequence tasks that operates in a lefttoright and online fashion. In other words, the model produces output as the input is received instead of waiting until the full input is received like most sequencetosequence models do. Key ideas used to make the model work include a recurrent attention mechanism, the use of an endofblock symbol in the output alphabet to indicate when the transducer should move to the next input block, and approximate algorithms based on dynamic programming and beam search for training and inference with the transducer model. Experiments on the TIMIT speech task show that the model works well and explore some of the design parameters of the model. Like similar models of this type, the input is processed by an encoder and a decoder produces an output sequence using the information provided by the encoder and conditioned on its own previous predictions. The method is evaluated on a toy problem and the TIMIT phoneme recognition task. The authors also propose some smaller ideas like two different attention mechanism variations. The map from block input to output is governed by a standard sequencetosequence model with additional state carried over from the previous block. Alignment of the two sequences is approximated by a dynamic program using a greedy local search heuristic. Experimental results are presented for phone recognition on TIMIT. The encoder is a multilayer LSTM RNN. The decoder is an RNN model conditioned on weighted sums of the last layer of the encoder and it's previous output. The weighting schemes (attention) varies and can be conditioned on the hidden states or also previous attention vectors. The decoder model produces a sequence of symbols, until it outputs a special end character "e" and is moved to the next block (other mechanisms where explored as well (no endofblocksymbol and separately predicting the end of a block given the attention vector). It is then fed the weighted sum of the next block of encoder states. The resulting sequence of symbols determines an alignment of the target symbols over the blocks of inputs, where each block may be assigned a variable number of characters. The system is trained by fixing an alignment, that approximately resembles the best alignment. Finding this approximately best alignment is akin to a beamsearch with a beam size of M (line 169), but a restricted set of symbols conditional on the last symbol in a particular hypothesis (since the target sequence is known). Alignments are computed less frequently than model updates (typically every 100 to 300 sequences). For inference, an unconstrained beamsearch procedure is performed with a threshold on sequence length and beam size. 
[link]
Authors present a method similar to teacher forcing that uses generative adversarial networks to guide training on sequential tasks. This work describes a novel algorithm to ensure the dynamics of an LSTM during inference follows that during training. The motivating example is sampling for a long number of steps at test time while only training on shorter sequences at training time. Experimental results are shown on PTB language modelling, MNIST, handwriting generation and music synthesis. The paper is similar to Generative Adversarial Networks (GAN): in addition to a normal sequence model loss function, the parameters try to “fool” a classifier. That classifier is trying to distinguish generated sequences from the sequence model, from real data. A few Objectives are proposed in section 2.2. The key difference to GAN is the B in equations 14. B is a function outputs some statistics of the model, such as the hidden state of the RNN, whereas GAN tries rather to discriminate the actual output sequences. This paper proposes a method for training recurrent neural networks (RNN) in the framework of adversarial training. Since RNNs can be used to generate sequential data, the goal is to optimize the network parameters in such a way that the generated samples are hard to distinguish from real data. This is particularly interesting for RNNs as the classical training criterion only involves the prediction of the next symbol in the sequence. Given a sequence of symbols $x_1, ..., x_t$, the model is trained so as to output $y_t$ as close to $x_{t+1}$ as possible. Training that way does not provide models that are robust during generation, as a mistake at time t potentially makes the prediction at time $t+k$ totally unreliable. This idea is somewhat similar to the idea of computing a sentencewide loss in the context of encodedecoder translation models. The loss can only be computed after a complete sequence has been generated. 
[link]
The authors propose to replace the notion of 'attention' in neural architectures with the notion of 'active memory' where rather than focusing on a single part of the memory one would operate on the whole of it in parallel. This paper introduces an extension to neural GPUs for machine translation. I found the experimental analysis section lacking in both comparisons to state of the art MT techniques as well as thoroughly evaluating the proposed method. This paper proposes active memory, which is a memory mechanism that operates all the part in parallel. The active memory was compared to attention mechanism and it is shown that the active memory is more effective for long sentence translation than the attention mechanism in EnglishFrench translation. This paper proposes two new models for modeling sequential data in the sequencetosequence framework. The first is called the Markovian Neural GPU and the second is called the Extended Neural GPU. Both models are extensions of the Neural GPU model (Kaiser and Sutskever, 2016), but unlike the Neural GPU, the proposed models do not model the outputs independently but instead connect the output token distributions recursively. The paper provides empirical evidence on a machine translation task showing that the two proposed models perform better than the Neural GPU model and that the Extended Neural GPU performs on par with a GRUbased encoderdecoder model with attention. 
[link]
This paper has a simple premise: that the, say, LSTM cell works better with multiplicative updates (equation 2) rather than additive ones (equation 1). This additive update is used in various places in lieu of additive ones, in various places in the LSTM recurrence equations (the exact formulation is in the supplementary material). A slightly hand wavy argument is made in favour of the multiplicative update, on the grounds of superior gradient flow (section 2.2). Mainly however, the authors make a rather thorough empirical investigation which shows remarkably good performance of their new architectures, on a range of real problems. Figure 1(a) is nice, showing an apparent greater information flow (as defined by a particular gradient) through time for the new scheme, as well as faster convergence and less saturated hidden unit activations. Overall, the experimental results appear thorough and convincing, although I am not a specialist in this area. This model presents a multiplicative alternative (with an additive component) to the additive update which happens at the core of various RNNs (Simple RNNs, GRUs, LSTMs). The multiplicative component, without introducing a significant change in the number of parameters, yields better gradient passing properties which enable the learning of better models, as shown in experiments. 
[link]
This paper proposes several definitions of measures of complexity of a recurrent neural network. They measure 1) recurrent depth (degree of multilayeredness as a function of time of recursive connections) 2) feedforward depth (degree of multilayeredness as a function of input > output connections) 3) recurrent skip coefficient (degree of directness, like the inverse of multilayeredness, of connections) In addition to the actual definitions, there are two main contributions:  The authors show that the measures (which are limits as the number of time steps > infinity) are well defined.  The authors correlate the measures with empirical performance in various ways, showing that all measure of depth can lead to improved performance. This paper provides 3 measures of complexity for RNNs. They then show experimentally that these complexity measures are meaningful, in the sense that increasingly complexity seems to correlated with better performance. The authors first present a rigorous graphtheoretic framework that describes the connecting architectures of RNNs in general, with which the authors easily explain how we can unfold an RNN. The authors then go on and propose tree architecture complexity measures of RNNs, namely the recurrent depth, the feedforward depth and the recurrent skip coefficient. Experiments on various tasks show the importance of certain measures on certain tasks, which indicates that those three complexity measures might be good guidelines when designing a recurrent neural network for certain tasks. 
[link]
The proposed approach consists in corrupting the training targets with a noise derived from the task reward while doing maximum likelihood training. This simple but specific smoothing of the target distribution allows to significantly boost the performance of neural structured output prediction as showcased on TIMIT phone and translation tasks. The link between this approach and RLbased expected reward maximization is also made clear by the paper, Prior work has chosen either maximum likelihood learning, which is relatively tractable but assumes a log likelihood loss, or reinforcement learning, which can be performed for a taskspecific loss function but requires sampling many predictions to estimate gradients. The proposed objective bridges the gap with "rewardaugmented maximum likelihood," which is similar to maximum likelihood but estimates the expected loss with samples that are drawn in proportion to their distance from the ground truth. Empirical results show good improvements with LSTMbased predictors on speech recognition and machine translation benchmarks relative to maximum likelihood training. This work is inspired by recent advancement in reinforcement learning and likelihood learning. The authors suggest to learn parameters so as to minimize the KL divergence between CRFs and a probability model that is proportional to the reward function (which the authors call payoff distribution, see Equation 4). The authors suggest an optimization algorithm for the KLdivergence minimization that depends on sampling from the payoff distribution. Current methods to learn a model for structured prediction include max margin optimisation and reinforcement learning. However, the max margin approach only optimises a bound on the true reward, and requires loss augmented inference to obtain gradients, which can be expensive. On the other hand, reinforcement learning does not make use of available supervision, and can therefore struggle when the reward is sparse, and furthermore the gradients can have high variance. The paper proposes a novel approach to learning for problems that involve structured prediction. They relate their approach to simple maximum likelihood (ML) learning and reinforcement learning (RL): ML optimises the KL divergence of a delta distribution relative to the model distribution, and RL optimises the KL divergence of the model distribution relative to the exponentiated reward distribution. They propose rewardaugmented maximum likelihood learning, which optimises the KL divergence of the exponentiated reward distribution relative to the model distribution. Compared to RL, the arguments of the KL divergence are swapped. Compared to ML, the delta distribution is generalised to the exponentiated reward distribution. Training is cheap in RML learning. It is only necessary to sample from the output set according to the exponentiated reward distribution. All experiments are performed in speech recognition and machine translation, where the structure over the output set is defined by the edit distance. An improvement is demonstrated over simple ML. 
[link]
Swapout is a method that stochastically selects forward propagation in a neural network from a palette of choices: drop, identity, feedforward, residual. Achieves best results on CIFAR10,100 that I'm aware of. This paper examines a stochastic training method for deep architectures that is formulated in such a way that the method generalizes dropout and stochastic depth techniques. The paper studies a stochastic formulation for layer outputs which could be formulated as $Y =\Theta_1 \odot X+ \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are tensors of i.i.d. Bernoulli random variables. This allows layers to either: be dropped $(Y=0)$, act a feedforward layer $Y=F(X)$, be skipped $Y=X$, or behave like a residual network $Y=X+F(X)$. The paper provides some well reasoned conjectures as to why "both dropout and swapout networks interact poorly with batch normalization if one uses deterministic inference", while also providing some nice experiments on the importance of the choice of the form of stochastic training schedules and the number of samples required to obtain estimates that make sampling useful. The approach is able to yield performance improvement over comparable models if the key and critical details of the stochastic training schedule and a sufficient number of samples are used. This paper proposes a generalization of some stochastic regularization techniques for effectively training deep networks with skip connections (i.e. dropout, stochastic depth, ResNets.) Like stochastic depth, swapout allows for connections that randomly skip layers, which has been shown to give improved performanceperhaps due to shorter paths to the loss layer and the resulting implicit ensemble over architectures with differing depth. However, like dropout, swapout is independently applied to each unit in a layer allowing for a richer space of sampled architectures. Since accurate expectation approximations are not easily attainable due to the skip connections, the authors propose stochastic inference (in which multiple forward passes are averaged during inference) instead of deterministic inference. To evaluate its effectiveness, the authors evaluate swapout on the CIFAR dataset, showing improvements over various baselines. 
[link]
The paper addresses the problem of compressive sensing MRI (CSMRI) by proposing a "deep unfolding" approach (cf. http://arxiv.org/abs/1409.2574) with a sparsitybased data prior and inference via ADMM. All layers of the proposed ADMMNet are based on a generalization of ADMM inference steps and are discriminatively trained to minimize a reconstruction error. In contrast to other methods for CSMRI, the proposed approach offers both high reconstruction quality and fast runtime. The basic idea is to convert the convention optimization based CS reconstruction algorithm into a fixed neural network learned with backpropagation algorithm. Specifically, the ADMMbased CS reconstruction is approximated with a deep neural network. Experimental results show that the approximated neural network outperforms several existing CSMRI algorithms with less computational time. The ADMM algorithm has proven to be useful for solving problems with differentiable and nondifferentiable terms, and therefore has a clear link with compressed sensing. Experiments prove some gain in performance with respect to the state of the art, specially in terms of computational cost at test time. 
[link]
A study of how scan orders influence Mixing time in Gibbs sampling. This paper is interested in comparing the mixing rates of Gibbs sampling using either systematic scan or random updates. The basic contributions are two: First, in Section 2, a set of cases where 1) systematic scan is polynomially faster than random updates. Together with a previously known case where it can be slower this contradicts a conjecture that the speeds of systematic and random updates are similar. Secondly, (In Theorem 1) a set of mild conditions under which the mixing times of systematic scan and random updates are not "too" different (roughly within squares of each other). First, following from a recent paper by Roberts and Rosenthal, the authors construct several examples which do not satisfy the commonly held belief that systematic scan is never more than a constant factor slower and a log factor faster than random scan. The authors then provide a result Theorem 1 which provides weaker bounds, which however they verify at least under some conditions. In fact the Theorem compares random scan to a lazy version of the systematic scan and shows that and obtains bounds in terms of various other quantities, like the minimum probability, or the minimum holding probability. MCMC is at the heart of many applications of modern machine learning and statistics. It is thus important to understand the computational and theoretical performance under various conditions. The present paper focused on examining systematic Gibbs sampling in comparison to random scan Gibbs. They do so first though the construction of several examples which challenge the dominant intuitions about mixing times, and develop theoretical bounds which are much wider than previously conjectured. 
[link]
The paper ([arxiv](https://arxiv.org/abs/1610.09716)) introduces DCNNs (Doubly Convolutional Neural Networks). Those are CNNs which contain a new layer type which generalized convolutional layers. ## Ideas CNNs seem to learn many filters which are similar to other learned filters in the same layer. The weights are only slightly shifted. The idea of double convolution is to learn groups filters where filters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta filters which has filter sizes that are larger than the effective filter size. Effective filters can be then extracted from each meta filter, which corresponds to convolving the meta filters with an identity kernel. All the extracted filters are then concatenated, and convolved with the input. > We have also confirmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer. ## Evaluation * CIFAR10+: 7.24% error * CIFAR100+: 26.53% error * ImageNet: 8.23% Top5 error ## Critique The ktranslation correlation is effectively a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). I think the authors should have mentioned that. ## Related TODO 
[link]
TLDR; The authors encourage exploration by adding a pseudoreward of the form $\frac{\beta}{\sqrt{count(state)}}$ for infrequently visited states. State visits are counted using Locality Sensitive Hashing (LSH) based on an environmentspecific feature representation like raw pixels or autoencoder representations. The authors show that this simple technique achieves gains in various classic RL control tasks and several games in the ATARI domain. While the algorithm itself is simple there are now several more hyperaprameters to tune: The bonus coefficient `beta`, the LSH hashing granularity (how many bits to use for hashing) as well as the type of feature representation based on which the hash is computed, which itself may have more parameters. The experiments don't paint a consistent picture and different environments seem to need vastly different hyperparameter settings, which in my opinion will make this technique difficult to use in practice. 
[link]
#### Introduction * Automated Theorem Proving (ATP)  Attempting to prove mathematical theorems automatically. * Bottlenecks in ATP: * **Autoformalization**  Semantic or formal parsing of informal proofs. * **Automated Reasoning**  Reasoning about already formalised proofs. * Paper evaluates the effectiveness of neural sequence models for premise selection (related to automated reasoning) without using hand engineered features. * [Link to the paper](https://arxiv.org/abs/1606.04442) #### Premise Selection * Given a large set of premises P, an ATP system A with given resource limits, and a new conjecture C, predict those premises from P that will most likely lead to an automatically constructed proof of C by A #### Dataset * Mizar Mathematical Library (MML) used as the formal corpus. * The premise length varies from 5 to 84299 characters and over 60% if the words occur fewer than 10 times (rare word problem). #### Approach * The model predicts the probability that a given axiom is useful for proving a given conjecture. * Conjecture and axiom sequences are separately embedded into fixed length real vectors, then concatenated and passed to a third network with few fully connected layers and logistic loss. * The two embedding networks and the joint predictor path are trained jointly. ##### Stage 1: Characterlevel Models * Treat premises on character level using an 80dimensional one hot encoding vector. * Architectures for embedding: * pure recurrent LSTM and GRU Network * CNN (with max pooling) * Recurrentconvolutional network that shortens the sequence using convolutional layer before feeding it to LSTM. ##### Stage 2: Wordlevel Models * MML dataset contains both implicit and explicit definitions. * To avoid manually encoding the implicit definitions, the entire statement defining an identifier is embedded and the definition embeddings are used as word level embeddings. * This is better than recursively expanding and embedding the word definition as the definition chains can be very deep. * Once word level embeddings are obtained, the architecture from Characterlevel models can be reused. #### Experiments ##### Metrics * For each conjecture, the model ranks the possible premises. * Primary metric is the number of conjectures proved from topk premises. * Average Max Relative Rank (AMMR) is more sophisticated measure based on the motivation that conjectures are easier to prove if all their dependencies occur earlier in ranking. * Since it is very costly to rank all axioms for a conjecture, an approximation is made and a fixed number of random false dependencies are used for evaluating AMMR. ##### Network Training * Asynchronous distributed stochastic gradient descent using Adam optimizer. * Clipped vs Nonclipped Gradients. * Max Sequence length of 2048 for characterlevel sequences and 500 for wordlevel sequences. ##### Results * Best Selection Pipeline  Stage 1 characterlevel CNN which produces wordlevel embeddings for the next stage. * Best models use simple CNNs followed by max pooling and twostage definitionbased defCNN outperforms naive wordCNN (where word embeddings are learnt in a single pass). 
[link]
"Aapo did it again!"  I exclaimed while reading this paper yesterday on the train back home (or at least I thought I was going home until I realised I was sitting on the wrong train the whole time. This gave me a couple more hours to think while traveling on a variety of longdistance buses...) Aapo Hyvärinen is one of my heroes  he did tons of cool work, probably most famous for pseudolikelihood, score matching and ICA. His recent paper, brought to my attention by my brand new colleague Hugo Larochelle, is similarly excellent: ### Summary Timecontrastive learning (TCL) trains a classifier that looks at a datapoint and guesses what part of the time series it came from. it exploits nonstationarity of time series to help representation learning an elegant connection to generative models (nonlinear ICA) is shown, although the assumptions of the model are pretty limiting TCL is the temporal analogue of representation learning with jigsaw puzzles similarly to GANs, logistic regression is deployed as a proxy to learn loglikelihood ratios directly from data Timecontrastive learning Timecontrastive learning (TCL) is a technique for learning to extract nonlinear representations from time series data. First, the time series is sliced up into a number of nonoverlapping chunks, indexed by ττ. Then, a multivariate logistic regression classifier is trained in a supervised manner to look at a sample taken from the series at an unknown time and predict ττ, the index of the chunk it came from. For this classifier, a neural network is used. The classifier itself is only a proxy to solving the representation learning problem. It turns out, if you chop off the final linear + softmax layer, the activations in the last hidden layer will learn to represent something fundamental, the logoddsratios in a probabilistic generative model (see paper for details). If one runs linear ICA over these hidden layer activations, the resulting network will learn to perform inference in a nonlinear ICA latent variable model. Moreover, if certain conditions about nonstationarity and the generative model are met, one can prove that the latent variable model is identifiable. This means that if the data was indeed drawn from the nonlinear ICA generative model, the resulting inference network  composed by the chopping off the top of the classifier and replacing it with a linear ICA layer  can infer the true hidden variables exactly. ### How practical are the assumptions? TCL relies on the nonstationarity of time series data: the statistics of data changes depending on which chunk or slice of the time series you are in, but it is also assumed that data are i.i.d. within each chunk. The proof also assumes that the chunkconditional data distributions are slightly modulated versions of the same nonlinear ICA generative model, this is how the model ends up identifiable  because we can use the different temporal chunks as different perspectives on the latent variables. I would say that these assumptions are not very practical, or at least on data such as natural video. Something along the lines of slowfeature analysis, with latent variables that exhibit more interesting behaviour over time would be desirable. Nevertheless, the model is complex enough to make a point, and I beleive TCL itself can be deployed more generally for representation learning. ### Temporal jigsaw It's not hard to see that TCL is analogous to a temporal version of the jigsaw puzzle method I wrote about last month. In the jigsaw puzzle method, one breaks up a single image into nonoverlapping chunks, shuffles them, and then trains a network to reassemble the pieces. Here, the chunking happens in the temporal domain instead. There are other papers that use the same general idea: training classifiers that guess the correct temporal ordering of frames or subsequences in videos. To do well at their job, these classifiers can end up learning about objects, motion, perhaps even a notion of inertia, gravity or causality. Ishan Misra et al. (2016) Unsupervised Learning using Sequential Verification for Action Recognition Basura Fernando et al. (2015) Modeling Video Evolution For Action Recognition In this context, the key contribution of Hyvarinen and Morioka's paper is to provide extra theoretical justification, and relating the idea to generative models. I'm sure one can use this framework to extend TCL to slightly more plausible generative models. ### Key takeaway `Logistic regression learns likelihood ratios` This is yet another example of using logistic regression as a proxy to estimating logprobabilityratios directly from data. The same thing happens in generative adversarial networks, where the discriminator learns to represent $\log P(x)  \log Q(x)$, where $P$ and $Q$ are the real and synthetic data distributions, respectively. This insight provides new ways in which unsupervised or semisupervised tasks can be reduced to supervised learning problems. As classification is now considered significantly easier than density estimation, direct probability ratio estimation may provide the easiest path forward for representation learning. 