[link]
This paper proposes a framework where an agent learns to navigate a 2D maze-like environment (XWORLD) from (templated) natural language commands, in the process simultaneously learning visual representations, syntax and semantics of language and performing navigation actions. The task is essentially VQA + navigation; at every step the agent either gets a question about the environment or navigation command, and the output is either a navigation action or answer. Key contributions: - Grounding and recognition are tied together to be two versions of the same problem. In grounding, given an image feature map and label (word), the problem is to find regions of the image corresponding to word semantics (attention map); and in recognition, given an image feature map and attention, the problem is to assign a word label. And thus word embeddings (for grounding) and softmax layer weights (for recognition) are tied together. This enables transferring concepts learnt during recognition to navigation. - Further, recognition is modulated by question intent. For e.g. given an attention map that highlights an agent's west, should it be recognized as 'west', 'apple' or 'red' (location, object or attribute)? It depends on what the question asks. Thus, GRU encoding of question produces an embedding mask that modulates recognition. The equivalent when grounding is that word embeddings are passed through fully-connected layers. - Compositionality in language is exploited by performing grounding and recognition by sequentially (softly) attending to parts of a sentence and grounding in image. The resulting attention map is selectively combined with attention from previous timesteps for final decision. ## Weaknesses / Notes Although the environment is super simple, it's a neat framework and it is useful that the target is specified in natural language (unlike prior/concurrent work e.g. Zhu et al., ICRA17). The model gets to see a top-down centred view of the entire environment at all times, which is a little weird. |
[link]
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities. RN is a plug-and-play module and although expects object representations as input, the semantics of what an object is need not be specified, so object representations can be convolutional layer feature vectors or entity embeddings from text, or something else. And the feedforward network is free to discover relations between objects (as opposed to being hand-assigned specific relations). - At its core, RN has two parts: - a feedforward network `g` that operates on pairs of object representations, for all possible pairs, all pairwise computations pooled via element-wise addition - a feedforward network `f` that operates on pooled features for downstream task, everything being trained end-to-end - When dealing with pixels (as in CLEVR experiment), individual object representations are spatially distinct convolutional layer features (196 512-d object representations for VGG conv5 say). The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates, shape, material, color, size. - For bAbI, object representations are LSTM encodings of supporting sentences. - For VQA tasks, `g` conditions its processing on question encoding as well, as relations that are relevant for figuring out the answer would be question-dependent. ## Strengths - Very simple idea, clearly explained, performs well. Somewhat shocked that it hasn't been tried before. ## Weaknesses / Notes Fairly simple idea — let a feedforward network operate on all pairs of object representations and figure out relations necessary for downstream task with end-to-end training. And it is fairly general in its design, relations aren't hand-designed and neither are object representations — for RGB images, these are spatially distinct convolutional layer features, for text, these are LSTM encodings of supporting facts, and so on. This module can be dropped in and combined with more sophisticated networks to improve performance at VQA. RNs also offer an alternative design choice to prior works on CLEVR, that have this explicit notion of programs or modules with specialized roles (that need to be pre-defined), as opposed to letting these relations emerge, reducing dependency on hand-designing modules and adding in inductive biases from an architectural point-of-view for the network to reason about relations (earlier end-to-end VQA models didn't have the capacity to figure out relations). |
[link]
This paper proposes a conditional GAN-based image captioning model. Given an image, the generator generates a caption, and given an image and caption, the discriminator/evaluator distinguishes between generated and real captions. Key ideas: - Since caption generation involves sequential sampling, which is non-differentiable, the model is trained with policy gradients, with the action being the choice of word at every time step, policy being the distribution over words, and reward the score assigned by the evaluator to generated caption. - The evaluator's role assumes a completely generated caption as input (along with image), which in practice leads to convergence issues. Thus to accommodate feedback for partial sequences during training, Monte Carlo rollouts are used, i.e. given a partial generated sequence, n completions are sampled and run through the evaluator to compute reward. - The evaluator's objective function consists of three terms - image-caption pairs from training data (positive) - image and generated captions (negative) - image and sampled captions for other images from training data (negative) - Both the generator and evaluator are pretrained with supervision / MLE, then fine-tuned with policy gradients. During inference, evaluator score is used as the beam search objective. ## Strengths This is neat paper with insightful ideas (Monte Carlo rollouts for assigning rewards to partial sequences, evaluator score as beam search objective), and is perhaps the first work on C-GAN-based image captioning. ## Weaknesses / Notes |
[link]
This paper performs a comparitive study of recent advances in deep learning with human-like learning from a cognitive science point of view. Since natural intelligence is still the best form of intelligence, the authors list a core set of ingredients required to build machines that reason like humans. - Cognitive capabilities present from childhood in humans. - Intuitive physics; for example, a sense of plausibility of object trajectories, affordances. - Intuitive psychology; for example, goals and beliefs. - Learning as rapid model-building (and not just pattern recognition). - Based on compositionality and learning-to-learn. - Humans learn by inferring a general schema to describe goals, object types and interactions. This enables learning from few examples. - Humans also learn richer conceptual models. - Indicator: variety of functions supported by these models: classification, prediction, explanation, communication, action, imagination and composition. - Models should hence have strong inductive biases and domain knowledge built into them; structural sharing of concepts by compositional reuse of primitives. - Use of both model-free and model-based learning. - Model-free, fast selection of actions in simple associative learning and discriminative tasks. - Model-based learning when a causal model has been built to plan future actions or maximize rewards. - Selective attention, augmented working memory, and experience replay are low-level promising trends in deep learning inspired from cognitive psychology. - Need for higher-level aforementioned ingredients. |
[link]
This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions: - Independent neural modules that can be combined together and jointly trained. - Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc. - Re-attention: FC-ReLU-FC-ReLU, weights are different for different instances. For example, re-attend[above], re-attend[not], etc. - Combination: Stacks two attention maps, followed by conv-ReLU to map to a single attention map. For example, combine[and], combine[except], etc. - Classification: Combines attention map and image, followed by FC-Softmax to map to answer. For example, classify[colors]. - Measurement: FC-ReLU-FC-Softmax, takes attention map as input. For example, measure[exists]. - Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them. - All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types. - Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]\(attend[cat]\), classify[where]\(attend[truck]\). - Predictions from the module network are combined with LSTM representations to get the final answer. - Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network. - Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'. - Experiments are performed on the synthetic SHAPES dataset and VQA dataset. - Performance on the SHAPES dataset is better as it is designed to benefit from compositionality. ## Strengths - This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach. ## Weaknesses / Notes - Mapping from syntactic structure to module network is hand-designed. Ideally, the model should learn this too to generalize. - Due to its compositional nature, this kind of model can possibly be used in the zero-shot learning setting, i.e. generalize to novel question types that the network hasn't seen before. |
[link]
This paper presents a way to reduce the expected network depth of deep residual networks during training by randomly dropping a subset of residual blocks and bypassing them with identity connections. The 'survival' probability $p\_l$ decreases linearly with depth (from 1.0 to 0.5 at last layer) so as to keep layers that extract low-level features with higher probability. At test time, residual block functions are scaled by the expected number of times it appears during training, i.e. $p\_l$. This model achieves lower test errors than ResNets (with ReLU activations) on CIFAR-10, CIFAR-100 and SVHN. ## Strengths - Shorter expected depth leads to faster training (>25% speedup). - Helps reduce the vanishing gradient problem as shown by the mean gradient magnitude v/s epochs plot. - Linear decay of survival probability works better than uniform survival, which supports the intuition that low-level features need to be reliably present. - Stochastic depth acts as a regularizer. The 1202-layer stochastic depth residual network shows improvements over the 110-layer network, while the original ResNets paper reports overfitting and higher test error with 1000+ layers. ## Weaknesses / Notes - Test errors for the updated ResNet architecture (ReLU activation inside residual function) are missing. That should perform better. Also, numbers on ImageNet. - Stochastic depth can be interpreted as sequential ensembling as compared to parallel ensembles. - It would be interesting to look at the filters learnt by stochastic depth residual networks, and to understand whether/how these networks learn hierarchical features as compared to the conventional CNN intuitions of compositionality. |
[link]
This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses. Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards: 1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward). 2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better). 3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question. The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward). Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on 1. Which of two outputs has better quality (single turn) 2. Which of two outputs is easier to respond to, and 3. Which of two conversations have better quality (multi turn). ## Strengths - Interesting results - Avoids generic responses - 'Ease of responding' reward encourages responses to be question-like - Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat. - Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response. ## Weaknesses / Notes - Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties. |
[link]
This paper introduces Residual Nets (ResNets), which was the winning submission (152-layer deep) at ILSVRC 2015 and MS-COCO 2015, and achieves a top-5 error rate of 3.57% (ensemble of two nets). Main contributions: - The key idea is that deeper networks face the degradation problem, i.e. higher training and test error than shallower nets, because they're harder to optimize for approximating identity mapping by multiple non-linear layers. - They mitigate this problem by forcing solvers to learn residual functions i.e. $f(x) = H(x) - x$, by adding shortcut connections. If identity mapping is the optimal formulation, the learned weights should drive $f(x)$ to 0 (and they observe that this is a suitable preconditioning as most residual function responses are small). - Shortcut connections (for identity mapping) don't require additional parameters. - Size transformations are done by zero-padding (no parameters) or projections. Projections introduce additional parameters and perform slightly better. - Bottleneck design is used to further reduce computational complexity, i.e. 1x1 convolutional layers before and after 3x3 convolutions to reduce and increase dimensions. - For detection and localization tasks, they use ResNets in the Faster-RCNN setting. ## Strengths - ResNets are significantly deeper and more accurate yet computationally cheaper than VGG. - A single ResNet outperforms previous state-of-the-art ensembles. Their final winning submission is an ensemble of two networks. ## Weaknesses / Notes - The idea of shortcut connections to force blocks to learn residual functions preconditioned on identity mapping is neat, and more so because it doesn't require additional parameters. - A lot of results and design decisions merit further investigation and reasoning. - Why do shortcuts skip 2 or 3 layers? What happens to performance if we increase the number of layers skipped? - How well do shortcut connections work with Inception modules? The statistical principles underlying both these architectures seem to be orthogonal, does performance further improve? - 152 seems to be an arbitrary number of layers that 'worked'. - The degradation problem seen when making networks deeper by initializing layers with identity weight matrices seems to be contradictory to the results presented in the Net2Net paper. |
[link]
This paper presents a neat method for learning spatio-temporal representations from videos. Convolutional features from intermediate layers of a CNN are extracted, to preserve spatial resolution, and fed into a modified GRU that can (in theory) learn infinite temporal dependencies. Main contributions: - Their variant of GRU (called GRU-RCN) uses convolution operations instead of fully-connected units. - This exploits the local correlation in image frames across spatial locations. - Features from pool2, pool3, pool4, pool5 are extracted and fed into independent GRU-RCNs. Hidden states at last time step are now feature volumes, which are average pooled to reduce to 1x1 spatially, and fed into a linear + softmax classifier. Outputs from each of these classifiers is averaged to get the final prediction. - Other variants that they experiment with are bidirectional GRU-RCNs and stacked GRU-RCNs i.e. GRU-RCNs with connections between them (with maxpool operations for dimensionality reduction). - Bidirectional GRU-RCNs perform the best. - Stacked GRU-RCNs perform worse than the other variants, probably because of limited data. - They evaluate their method on action recognition and video captioning, and show significant improvements on a CNN+RNN baseline, comparing favorably with other state-of-the-art methods (like C3D). ## Strengths - The idea is simple and elegant. Earlier methods for learning video representations typically used 3D convolutions (k x k x T filters), which suffered from finite temporal capacity, or RNNs sitting on top of last-layer CNN features, which is unable to capture finer spatial resolution. In theory, this formulation solves both. - Changing fully-connected operations to convolutions has the additional advantage of requiring lesser parameters (n\_input x n\_output x input\_width x input\_height v/s n\_input x n\_output x k\_width x k\_height). |
[link]
This paper presents a model that can dynamically split computation across coarse, low-capacity sub-networks and fine, high-capacity sub-networks. The coarse model processes the entire input data and is typically shallow while the fine model focuses on a few important regions of the input and is deeper. For images as input, this is a hard attention mechanism that can be trained with stochastic gradient descent and doesn't require a task-specific attention policy trained by reinforcement learning. Key ideas: - A deep network h can be decomposed into bottom layers f and top layers g such that $h(x) = g(f(x))$. Further, f consists of two alternate sub-networks $f\_c$ and $f\_f$. $f\_c$ is a low-capacity sub-network while $f\_f$ is a high-capacity sub-network. - g should be able to use representations from $f\_c$ and $f\_f$ dynamically. $f\_c$ processes the entire input while $f\_f$ only a few important regions of the input. - The coarse model processes the entire input and the norm of the gradient of the entropy with respect to the coarse vector at each spatial region is computed which is a measure of saliency. The use of the entropy gradient as a saliency measure encourages selecting input regions that could affect the uncertainty in the model’s predictions the most. - The top-k input regions with highest saliency values are processed by the fine model. The refined representation for input to the top layers consists of both coarse and fine vectors. During backpropagation, gradients are computed for the refined model, i.e. propagating gradients at each position into either the coarse or fine features, depending on which was used. - To make sure $f\_c$ and $f\_f$ representations are interchangeable and input to the top layers has smooth transitions, an additional objective term minimizes the squared distance between coarse and fine representations and this additional term is used only to optimize the coarse layers, not the fine layers. - Experiments on cluttered MNIST, SVHN and comparison with RAM, DRAW and study with various values of number of patches for fine processing. ## Strengths - Neat, general way to split computation based on importance of input; a hard-attention mechanism that can be trained with SGD, unlike RAM. - Entropy gradient as a measure of saliency is an interesting idea, and it doesn't need labels i.e. can be used at test time. |
[link]
This is follow-up work to the ResNets paper. It studies the propagation formulations behind the connections of deep residual networks and performs ablation experiments. A residual block can be represented with the equations $y_l = h(x_l) + F(x_l, W_l); x_{l+1} = f(y_l)$. $x_l$ is the input to the l-th unit and $x_{l+1}$ is the output of the l-th unit. In the original ResNets paper, $h(x_l) = x_l$, $f$ is ReLu, and F consists of 2-3 convolutional layers (bottleneck architecture) with BN and ReLU in between. In this paper, they propose a residual block with both $h(x)$ and $f(x)$ as identity mappings, which trains faster and performs better than their earlier baseline. Main contributions: - Identity skip connections work much better than other multiplicative interactions that they experiment with: - Scaling $(h(x) = \lambda x)$: Gradients can explode or vanish depending on whether modulating scalar \lambda > 1 or < 1. - Gating ($1-g(x)$ for skip connection and $g(x)$ for function F): For gradients to propagate freely, $g(x)$ should approach 1, but F gets suppressed, hence suboptimal. This is similar to highway networks. $g(x)$ is a 1x1 convolutional layer. - Gating (shortcut-only): Setting high biases pushes initial $g(x)$ towards identity mapping, and test error is much closer to baseline. - 1x1 convolutional shortcut: These work well for shallower networks (~34 layers), but training error becomes high for deeper networks, probably because they impede gradient propagation. - Experiments on activations. - BN after addition messes up information flow, and performs considerably worse. - ReLU before addition forces the signal to be non-negative, so the signal is monotonically increasing, while ideally a residual function should be free to take values in (-inf, inf). - BN + ReLU pre-activation works best. This also prevents overfitting, due to BN's regularizing effect. Input signals to all weight layers are normalized. ## Strengths - Thorough set of experiments to show that identity shortcut connections are easiest for the network to learn. Activation of any deeper unit can be written as the sum of the activation of a shallower unit and a residual function. This also implies that gradients can be directly propagated to shallower units. This is in contrast to usual feedforward networks, where gradients are essentially a series of matrix-vector products, that may vanish, as networks grow deeper. - Improved accuracies than their previous ResNets paper. ## Weaknesses / Notes - Residual units are useful and share the same core idea that worked in LSTM units. Even though stacked non-linear layers are capable of asymptotically approximating any arbitrary function, it is clear from recent work that residual functions are much easier to approximate than the complete function. The [latest Inception paper](http://arxiv.org/abs/1602.07261) also reports that training is accelerated and performance is improved by using identity skip connections across Inception modules. - It seems like the degradation problem, which serves as motivation for residual units, exists in the first place for non-idempotent activation functions such as sigmoid, hyperbolic tan. This merits further investigation, especially with recent work on function-preserving transformations such as [Network Morphism](http://arxiv.org/abs/1603.01670), which expands the Net2Net idea to sigmoid, tanh, by using parameterized activations, initialized to identity mappings. |
[link]
This paper presents a simple method to accelerate the training of larger neural networks by initializing them with parameters from a trained, smaller network. Networks are made wider or deeper while preserving the same output as the smaller network which maintains performance when training starts, leading to faster convergence. Main contributions: - Net2Deeper - Initialize layers with identity weight matrices to preserve the same output. - Only works when activation function $f$ satisfies $f(If(x)) = f(x)$ for example ReLU, but not sigmoid, tanh. - Net2Wider - Additional units in a layer are randomly sampled from existing units. Incoming weights are kept the same while outgoing weights are divided by the number of replicas of that unit so that the output at the next layer remains the same. - Experiments on ImageNet - Net2Deeper and Net2Wider models converge faster to the same accuracy as networks initialized randomly. - A deeper and wider model initialized with Net2Net from the Inception model beats the validation accuracy (and converges faster). ## Strengths - The Net2Net technique avoids the brief period of low performance that exists in methods that initialize some layers of a deeper network from a trained network and others randomly. - This idea is very useful in production systems which essentially have to be lifelong learning systems. Net2Net presents an easy way to immediately shift to a model of higher capacity and reuse trained networks. - Simple idea, clearly presented. ## Weaknesses / Notes - The random mapping algorithm for different layers was done manually for this paper. Developing a remapping inference algorithm should be the next step in making the Net2Net technique more general. - The final accuracy that Net2Net models achieve seems to depend only on the model capacity and not the initialization. I think this merits further investigation. In this paper, it might just be because of randomness in training (dropout) or noise added to the weights of the new units to approximately represent the same function (when not using dropout). |
[link]
This paper proposes the use of pretrained convolutional neural networks that have already learned to encode semantic information as loss functions for training networks for style transfer and super-resolution. The trained networks corresponding to selected style images are capable of performing style transfer for any content image with a single forward pass (as opposed to explicit optimization over output image) achieving as high as 1000x speedup and similar qualitative results as Gatys et al. Key contributions: - Image transformation network - Convolutional neural network with residual blocks and strided & fractionally-strided convolutions for in-network downsampling and upsampling. - Output is the same size as input image, but rather than training the network with a per-pixel loss, it is trained with a feature reconstruction perceptual loss. - Loss network - VGG-16 with frozen weights - Feature reconstruction loss: Euclidean distance between feature representations - Style reconstruction loss: Frobenius norm of the difference between Gram matrices, performed over a set of layers. - Experiments - Similar objective values and qualitative results as explicit optimization over image as in Gatys et al for style transfer - For single-image super-resolution, feature reconstruction loss reconstructs fine details better and 'looks' better than a per-pixel loss, even though PSNR values indicate otherwise. Respectable results in comparison to SRCNN. ## Weaknesses / Notes - Although fast, limited by styles at test-time (as opposed to iterative optimizer that is limited by speed and not styles). Ideally, there should be a way to feed in style and content images, and do style transfer with a single forward pass. |
[link]
This paper presents a re-parameterization of the LSTM to successfully apply batch normalization, which results in faster convergence and improved generalization on a several sequential tasks. Main contributions: - Batch normalization is applied to the input to hidden and hidden to hidden projections. - Separate statistics are maintained for each timestep, estimated over each minibatch during training and over the whole dataset during test. - For generalization to longer sequences during test time, population statistics of time T\_max are used for all time steps beyond it. - The cell state is left untouched so as not to hinder the gradient flow. - Proper initialization of batch normalization parameters to avoid vanishing gradients. - They plot norm of gradient of loss wrt hidden state at different time steps for different BN variance initializations. High variance ($\gamma = 1$) causes gradients to die quickly by driving activations to the saturation region. - Initializing BN variance to 0.1 works well. ## Strengths - Simple idea, the authors finally got it to work. Proper initialization of BN parameters and maintaining separate estimates for each time step play a key role. ## Weaknesses / Notes - It would be useful in practice to put down a proper formulation for using batch normalization with variable-length training sequences. |
[link]
This paper introduces an interpretation of deep residual networks as implicit ensembles of exponentially many shallow networks. For a residual block $i$, there are $2^{i-1}$ paths from input to $i$, and the input to $i$ is a mixture of $2^{i-1}$ different distributions. The interpretation is backed by a number of experiments such as removing or re-ordering residual blocks at test time and plotting norm of gradient v/s number of residual blocks the gradient signal passes through. Removing $k$ residual blocks (for k <= 20) from a network of depth n decreases the number of paths to $2^{n-k}$ but there are still sufficiently many valid paths to not hurt classification error, whereas sequential CNNs have a single viable path which gets corrupted. Plot of gradient at input v/s path length shows that almost all contributions to the gradient come from paths shorter than 20 residual blocks, which are the effective paths. The paper concludes by saying that network 'multiplicity', which is the number of paths, plays a key role in terms of the network's expressability. ## Strengths - Extremely insightful set of experiments. These experiments nail down the intuitions as to why residual networks work, as well as clarify the connections with stochastic depth (sampling the network multiplicity during training i.e. ensemble by training) and highway networks (reduction in number of available paths by gating both skip connections and paths through residual blocks). ## Weaknesses / Notes - Connections between effective paths and model compression. |
[link]
This paper introduces a modification to the ResNets architecture with multi-level shortcut connections (shortcut from input to pre-final layer as level 1, shortcut over each residual block group as level 2, etc) as opposed to single-level shortcut connections in prior work on ResNets. The authors perform experiments with multi-level shortcut connections on regular ResNets, ResNets with pre-activations and Wide ResNets. Combined with drop-path regularization via stochastic depth and exploration over optimal shortcut level number and optimal depth/width ratio to avoid vanishing gradients and overfitting, this architecture achieves state-of-the-art error rates on CIFAR-10 (3.77%), CIFAR-100 (19.73%) and SVHN (1.59%). ## Strengths - Fairly exhaustive set of experiments over - Shortcut level numbers. - Identity mapping types: 1) zero-padding shortcuts, 2) 1x1 convolutions for projections and others identity, and 3) all 1x1 convolutions. - Residual block size (2 or 3 3x3 convolutional layers). - Depths (110, 164, 182, 218) and widths for both ResNets and Pre-ResNets. |
[link]
This paper introduces an end-to-end trainable neural model capable of performing analogical reasoning in image representations followed by decoding back to image space. Specifically, given a 4-tuple A:B::C:D, the task is to apply the transformation A:B to C. The motivation is clear — humans are excellent at generalizing to hypothetical transformations about images ("what if this chair were rotated 30 degrees clockwise?"). - The objective function follows directly from vector addition: $MSE(d - g(f(b) - f(a) + f(c)))$ where $f$ and $g$ are convolutional neural networks. - In case of rotation, a purely additive transformation is not optimal because repeated application of this transformation to the same query image will never return to the original point. Instead, multiplicative interactions or MLPs are used to condition the transformation on $c$ as well. - Analogy-making is also performed on disentangled representations, which separate factors of variation to separate coordinates and are learnt from distinct images $a,b, c$ such that the objective is $MSE(c - g(s . f(a) + (1-s) . f(b)))$ where $s$ are switch variables to disentangle features. Disentangled image features allow the analogy-making model to traverse the manifold of a given factor or subset of factors. - Experiments on transforming shapes, generating 2D video game sprites and 3D car renderings. ## Strengths - Neat idea, well-presented |
[link]
This paper introduces the task of dense captioning and proposes a network architecture that processes an image and produce region descriptions in a single pass and can be trained end-to-end. Main contributions: - Dense captioning - Generalization of object detection (caption consists of single word) and image captioning (region consists of whole image). - Fully convolution localization network - Fully differentiable, can be trained jointly with the rest of the network - Consists of a region proposal network, box regression (similar to Faster R-CNN) and bilinear interpolation (similar to Spatial Transformer Networks) for sampling. - Network details - Convolutional layer features are extracted for image - For each element in the feature map, k anchor boxes of different aspect ratios are selected in the input image space. - For each of these, the localization layer predicts offsets and confidence. - The region proposals are projected on the convolutional feature map and a sampling grid is computed from output feature map to input (bilinear sampling). - The computed feature map is passed through an MLP to compute representations corresponding to each region. - These are passed (in a batch) as the first word to an LSTM (Show and Tell) which is trained to predict each word of the caption. ## Strengths - Fully differentiable 'spatial attention' mechanism (bilinear interpolation) in place of RoI pooling as in the case of Faster R-CNN. - RoI pooling is not differentiable with respect to the input proposal coordinates. - Fast, and impressive qualitative results. ## Weaknesses / Notes The model is very well engineered together from different works (Faster R-CNN + Spatial Transformer Networks + Show & Tell). |
[link]
This paper introduces a neural network architecture that generates realistic images sequentially. They also introduce a differentiable attention mechanism that allows the network to focus on local regions of the image during reconstruction. Main contributions: - The network architecture is similar to other variational auto-encoders, except that - The encoder and decoder are recurrent networks (LSTMs). The encoder's output is conditioned on the decoder's previous outputs, and the decoder's outputs are iteratively added to the resulting distribution from which images are generated. - The spatial attention mechanism restricts the input region observed by the encoder and available to write for the decoder. ## Strengths - The spatial soft attention mechanism is effective and fully differentiable, and can be used for other tasks. - Images generated by DRAW look very realistic. ## Weaknesses / Notes |
[link]
This paper introduces an attention mechanism (soft memory access) for the task of neural machine translation. Qualitative and quantitative results show that not only does their model achieve state-of-the-art BLEU scores, it performs significantly well for long sentences which was a drawback in earlier NMT works. Their motivation comes from the fact that encoding all information from an input sentence into a single fixed length vector and using that in the decoder was probably a bottleneck. Instead, their decoder uses an attention vector, which is a weighted sum of the input hidden states, and is learned jointly. Main contributions: - The encoder is a bidirectional RNN, in which they take the annotation of each word to be the concatenation of the forward and backward RNN states. The idea is that the hidden state should encode information from both the previous and following words. - The proposed attention mechanism is a weighted sum of the input hidden states, the weights for which come from an attention function (a single-layer perceptron, which takes as input the previous hidden state of the decoder and the current word annotation from the encoder) and are softmax-normalized. ## Strengths - Incorporating the attention mechanism shows large improvements on longer sentences. The attention matrix is easily interpretable as well, and visualizations in the paper show that higher weights are being assigned to input words that correspond to output words irrespective of their order in the sequence (unlike an attention model that uses a mixture of Gaussians which is monotonic). ## Weaknesses / Notes - Their model formulation to capture long-term dependencies is far more principled than Sutskever et al's inverting the input idea. They should have done a comparative study with their approach as well though. |
[link]
This paper hypothesizes that a CNN trained for scene classification automatically discovers meaningful object detectors, representative of the scene categories, without any explicit object-level supervision. This claim is backed by well-designed experiments which are a natural extension of the primary insight that since scenes are composed of objects (a typical bedroom would have a bed, lamp; art gallery would have paintings, etc), a CNN that performs reasonable well on scene recognition must be localizing objects in intermediate layers. ## Strengths - Demonstrates the difference in learned representations in Places-CNN and ImageNet-CNN. - The top 100 images that have the largest average activation per layer are picked and it's shown that earlier layers such as pool1 prefer similar images for both networks while deeper layers tend to be more specialized to the specific task of scene or object categorization i.e. ~75% of the top 100 images that show high activations for fc7 belong to ImageNet for ImageNet-CNN and Places for Places-CNN. - Simplifies input images to identify salient regions for classification. - The input image is simplified by iteratively removing segments that cause the least decrease in classification score until the image is incorrectly classified. This leads them to the minimal image representation (sufficient and necessary) that is needed by the network to correctly recognize scenes, and many of these contain objects that provide discriminative information for scene classification. - Visualizes the 'empirical receptive fields' of units. - The top K images with highest activations for a given unit are identified. To identify which regions of the image lead to high unit activations, the image is replicated with occluders at different regions. The occluded images are passed through the network and large changes in activation indicate important regions. This leads them to generate feature maps and finally to empirical receptive fields after appropriate centre-calibration, which are more localized and smaller than the theoretical size. - Studies the visual concepts / semantics captured by units. - AMT workers are surveyed on the segments that maximally activate units. They're asked to tag the visual concept, mark negative samples and provide the level of abstraction (from simple elements and colors to objects and scenes). Plot of distribution of semantic categories at each layer shows that deeper layers do capture higher levels of abstraction and Places-CNN units indeed discover more objects than ImageNet-CNN units. ## Weaknesses / Notes - Unclear as to how they obtain soft, grayed out images from the iterative segmentation methodology in the first approach where they generate minimal image representations needed for accurate classification. I would assume these regions to be segmentations with black backgrounds and hard boundaries. Perez et al. (2013) might have details regarding this. |
[link]
This paper introduces a neural networks module that can learn input-dependent spatial transformations and can be inserted into any neural network. It supports transformations like scaling, cropping, rotations, and non-rigid deformations. Main contributions: - The spatial transformer network consists of the following: - Localization network that regresses to the transformation parameters given the input. - Grid generator that uses the transformation parameters to produce a grid to sample from the input. - Sampler that produces the output feature map sampled from the input at the grid points. - Differentiable sampling mechanism - The sampling is written in a way such that sub-gradients can be defined with respect to grid coordinates. - This enables gradients to be propagated through the grid generator and localization network, and for the network to jointly learn the spatial transformer along with rest of the network. - A network can have multiple STNs - at different points in the network, to model incremental transformations at different levels of abstraction. - in parallel, to learn to focus on different regions of interest. For example, on the bird classification task, they show that one STN learns to be a head detector, while the other focuses on the central part of the body. ## Strengths - Their attention (and by extension transformation) mechanism is differentiable as opposed to earlier works on non-differentiable attention mechanisms that used reinforcement learning (REINFORCE). It also supports a richer variety of transformations as opposed to earlier works on learning transformations, like DRAW. - State-of-the-art classification performance on distorted MNIST, SVHN, CUB-200-2011. ## Weaknesses / Notes This is a really nice way to generalize spatial transformations in a differentiable manner so the model can be trained end-to-end. Classification performance, and more importantly, qualitative results of the kind of transformations learnt on larger datasets (like ImageNet) should be evaluated. |
[link]
This paper introduces a Stacked Attention Network (SAN) for visual question answering. SAN uses a multiple layer attention mechanism that uses the semantic question representation to query the image and locate relevant visual regions, and to infer the answer. Details of the SAN model: - Image features are extracted from the last pooling layer of a deep CNN (like VGG-net). - Input images are first scaled to 448 x 448, so at the last pooling layer, features have the dimension 14 x 14 x 512 i.e. 512-dimensional vectors at each image location with a receptive field of 32 x 32 in input pixel space. - Question features are the last hidden state of the LSTM. - Words are one-hot encoded, transferred to a vector space by passing through an embedding matrix and these word vectors are fed into the LSTM at each time step. - Image and question features are combined into a query vector to locate relevant visual regions. - Both the LSTM hidden state and 512-d image feature vector at each location are transferred to the same dimensionality (say k) by a fully connected layer, and added and passed through a non-linearity (tanh). - Each k-dimensional feature vector is then transformed down to a single scalar and a softmax is taken over all image regions to get the attention distribution (say p\_{I}). - This attention distribution is used to weight the pooling layer visual features (\sum_{i}p\_{i}v\_{i}) and added to the LSTM vector to get a new query vector. - In subsequent attention layers, this updated query vector is used to repeat the same process of getting an attention distribution. - The final query vector is used to compute a softmax over the answers. ## Strengths - The multi-layer attention mechanism makes sense intuitively and the qualitative results somewhat indicate that going from the first attention layer to subsequent attention layers, the network is able to focus on fine-grained visual regions as it discovers relationships among multiple objects ('what are sitting in the basket on a bicycle'). - SAN benefits VQA, they demonstrate state-of-the-art accuracies on multiple datasets, with question-type breakdown as well. ## Weaknesses / Notes - Right now, the attention distribution is learnt in an unsupervised manner by the network. It would be interesting to think about adding supervisory attention signal. Another way to improve accuracies would be to use deeper LSTMs. |
[link]
This paper simplifies the convolutional network proposed by Alex Krizhevsky by replacing max-pooling with strided convolutions (under the assumption that max-pooling is required only for dimensionality reduction). They also propose a novel technique for visualizing representations learnt by intermediate layers that produces nicer visualizations in input pixel space than DeconvNet (Zeiler et al) and Saliency map (Simonyan at al) approaches. ## Strengths - Their model performs at par or better than the original AlexNet formulation. - Max-pooling replaced by convolution with stride 2 - Fully-connected layers replaced by 1x1 convolutions and global averaging + softmax - Smaller filter size (same intuition as VGGNet paper) - Combining the DeconvNet (Zeiler et al.) and backpropagation (Simonyan et al.) approaches at the ReLU operator (which is the only point of difference) by masking out values where at least one of input activation or output reconstruction is negative (guided backprop) is neat and leads to nice visualizations. ## Weaknesses / Notes - Saliency maps generated from guided backpropagation definitely look much better as compared to DeconvNet visualizations and saliency maps from Simonyan et al's paper. It works better probably because the negative saliency values only arise from the very first convolution, since negative error signals are never propagated back through the non-linearities. |
[link]
This paper models object detection as a regression problem for bounding boxes and object class probabilities with a single pass through the CNN. The main contribution is the idea of dividing the image into a 7x7 grid, and having each cell predict a distribution over class labels as well as a bounding box for the object whose center falls into it. It's much faster than R-CNN and Fast R-CNN, as the additional step of extracting region proposals has been removed. ## Strengths - Works real-time. Base model runs at 45fps and a faster version goes up to 150fps, and they claim that it's more than twice as fast as other works on real-time detection. - End-to-end model; Localization and classification errors can be jointly optimized. - YOLO makes more localization errors and fewer background mistakes than Fast R-CNN, so using YOLO to eliminate false background detections from Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN is much slower). ## Weaknesses / Notes - Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN). - Performs worse at detecting small objects, as at most one object per grid cell can be detected. |
[link]
This paper reports on a series of experiments with CNNs trained on top of pre-trained word vectors for sentence-level classification tasks. The model achieves very good performance across datasets, and state-of-the-art on a few. The proposed model has an input layer comprising of concatenated 'word2vec' embeddings, followed by a single convolutional layer with multiple filters, max-pooling over time, fully connected layers and softmax. They also experiment with static and non-static channels which basically implies whether they finetune word2vec embeddings or not. ## Strengths - Very simple yet powerful model formulation, which achieves really good performance across datasets. - The different model formulations drive home the point that initializing input vectors with word2vec embeddings is better than random initializations. Finetuning these embeddings for the task leads to further improvements over static embeddings. ## Weaknesses / Notes - No intuition as to why the model with both static and non-static channels gives mixed results. - They briefly mention that they experimented with SENNA embeddings which lead to worse results although no quantitative results are provided. It would have been interesting to have a comparative study with GloVe embeddings as well. |
[link]
This paper attempts to understand the representations learnt by deep convolutional neural networks by introducing two interpretable visualization techniques. Main contributions: - Class model visualizations - These are obtained by making numerical optimizations in the input space to maximize the class score. Gradients are calculated wrt input and are used to update the input image (initialized with zero image), while weights are kept fixed to those obtained from training. - Image-specific saliency map visualizations - These are approximated by using the same gradient as before (gradient of class score wrt input). The absolute pixel-wise max across channels produces the saliency map. - Relation between DeconvNet and optimization-based visualizations - Visualizations using DeconvNet are the same as gradient-based methods except for ReLU. In regular backprop, gradients flow through ReLU to units with positive input activations, whereas in case of a DeconvNet, it is computed on positive output reconstructions. ## Strengths - The visualization techniques are simple ideas and the results are interpretable. They show that the method proposed by Erhan et al. in an unsupervised setting is useful to CNNs trained in a supervised manner as well. - The image-specific class saliency can be interpreted as those pixels which need to be changed the least to have a maximum impact on the classification score. - The relation between DeconvNet visualizations and optimization-based visualizations is insightful. ## Weaknesses / Notes - The thinking behind initializing with zero image and L2 regularization in class model visualizations was missing. |
[link]
This paper introduces a neural network architecture that is deeper and wider, yet optimizing for computational efficiency by approximating the expected sparse structure (following from Arora et al's work) using readily available dense blocks. An ensemble of 7 models (all with the same architecture but different image sampling) achieved top spot in the classification task at ILSVRC2014. "Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs." Main contributions: - A more generalized exploration of the NIN architecture, called the Inception module. - 1x1 convolutions to capture dense information clusters - 3x3 and 5x5 to capture more spatially spread out clusters - Ratio of 3x3 and 5x5 to 1x1 convolutions increases as we go deeper as features of higher abstraction are less spatially concentrated. - To avoid the blow-up of output channels cause by merging outputs of convolutional layers and pooling layer, they use 1x1 convolutions for dimensionality reduction. This has the added benefit of another layer of non-linearity (and thus increasing discriminative capability). - Multiple intermediate layers are tied to the objective function. Since features produced by intermediate layers of a deep network are supposed to be very discriminative, and to strengthen the gradient signal passing through them during back-propagation, they attach auxiliary classifiers to intermediate layers. - During training, they do a weighted sum of this loss with the total loss of the network. - At test time, these auxiliary networks are discarded. - Architecture: average pooling, 1x1 convolution (for dimensionality reduction), dropout, linear layer with softmax. ## Strengths - Excellent results on ILSVRC2014. ## Weaknesses / Notes - Even though the authors try to explain some of the intuition, most of the design decisions seem arbitrary. |
[link]
This paper studies the transferability of features learnt at different layers of a convolutional neural network. Typically, initial layers of a CNN learn features that resemble Gabor filter or color blobs, and are fairly general, while the later layers are more task-specific. Main contributions: - They create two splits of the ImageNet dataset (A/B) and explore how performance varies for various network design choices such as - Base: CNN trained on A or B. - Selffer: first n layers are copied from a base network, and the rest of the network is randomly initialized and trained on the same task. - Transfer: first n layers are copied from a base network, and the rest of the network is trained on a different task. - Each of these 'copied' layers can either be fine-tuned or kept frozen. - Selffer networks without fine-tuning don't perform well when the split is somewhere in the middle of the network (n = 3-6). This is because neurons in these layers co-adapt to each other's activations in complex ways, which get broken up when split. - As we approach final layers, there is lesser for the network to learn and so these layers can be trained independently. - Fine-tuning a selffer network gives it the chance to re-learn co-adaptations. - Transfer networks transferred at lower n perform better than larger n, indicating that features get more task-specific as we move to higher layers. - Fine-tuning transfer networks, however, results in better performance. They argue that better generalization is due to the effect of having seen the base dataset, even after considerable fine-tuning. - Fine-tuning works much better than using random features. - Features are more transferable across related tasks than unrelated tasks. - They study transferability by taking two random data splits, and splits of man-made v/s natural data. ## Strengths - Experiments are thorough, and the results are intuitive and insightful. ## Weaknesses / Notes - This paper only analyzes transferability across different splits of ImageNet (as similar/dissimilar tasks). They should have reported results on transferability from one task to another (classification/detection) or from one dataset to another (ImageNet/MSCOCO). - It would be interesting to study the role of dropout in preventing co-adaptations while transferring features. |
[link]
The paper introduces two key properties of deep neural networks: - Semantic meaning of individual units. - Earlier works analyzed learnt semantics by finding images that maximally activate individual units. - Authors observe that there is no difference between individual units and random linear combinations of units. - It is the entire space of activations that contains the bulk of semantic information. - Stability of neural networks to small perturbations in input space. - Networks that generalize well are expected to be robust to small perturbations in the input, i.e. imperceptible noise in the input shouldn't change the predicted class. - Authors find that networks can be made to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. - These 'adversarial examples' generalize well to different architectures trained on different data subsets. ## Strengths - The authors propose a way to make networks more robust to small perturbations by training them with adversarial examples in an adaptive manner, i.e. keep changing the pool of adversarial examples during training. In this regard, they draw a connection with hard-negative mining, and a network trained with adversarial examples performs better than others. - Formal description of how to generate adversarial examples and mathematical analysis of a network's stability to perturbations are useful studies. ## Weaknesses / Notes - Two images that are visually indistinguishable to humans but classified differently by the network is indeed an intriguing observation. - The paper feels a little half-baked in parts, and some ideas could've been presented more clearly. |
[link]
This paper introduces the Places dataset, which is a scene-centric dataset at the scale of ImageNet (which is for object recognition) so as to enable training of deep CNNs like AlexNet, and achieves state-of-the-art for scene benchmarks. Main contributions: - Collects a dataset at ImageNet scale for scene recognition. - Achieves state-of-the-art on scene benchmarks: SUN397, MIT Indoor67, Scene15, SUN Attribute. - Introduces measures for comparing datasets: density and diversity. - Makes a thorough comparison b/w ImageNet and Places, from dataset to classification results to learned representation visualizations. ## Strengths - Relative density and diversity are neat ideas for comparing datasets, and are backed by AMT experiments. - Relative density: The more visually similar a nearest neighbour is to a randomly sampled image from a dataset, the more dense it is. - Relative diversity: The more visually similar two randomly sampled images from a dataset are, the less diverse it is. - Demonstrates via activation and mean image visualizations that different representations are learned by CNNs trained on ImageNet and Places - Conv1 layer visualizations can be directly seen, and are similar for ImageNet-CNN and Places-CNN. They capture low-level information like oriented edges and colors. - For higher layers, they visualize the average of top 100 images that maximize activations per unit. As we go deeper, ImageNet-CNN units have receptive fields that look more like object-blobs and Places-CNN have RFs that look more like landscapes with spatial structures. ## Weaknesses / Notes - No explanation as to why the model trained on ImageNet and Places combined (minus overlapping images) performs better than ImageNet-CNN or Places-CNN on some benchmarks and worse on others. |
[link]
This paper studies a very natural generalization of convolutional layers by replacing a single filter that slides over the input feature map with a "micro network" (multi-layer perceptron). The authors argue that good abstractions are highly non-linear functions of input data and instead of generating an overcomplete number of feature maps and shrinking them down in higher layers (as is the case in traditional CNNs), it would be beneficial to generate better representations on each local patch, before feeding into the next layer. Main contributions: - Replaces the convolutional filter with a multi-layer perceptron. - Instead of fully connected layers, uses global average pooling. ## Strengths - Natural generalization of convolutional layers and thorough analysis. - Global average pooling of feature layers is easier to interpret and less prone to overfitting. - Better or at par with state-of-the-art classification results on CIFAR-10, CIFAR-100, SVHN, MNIST. ## Weaknesses / Notes - Should have explored NIN without dropout. - Results on ImageNet missing. - The global average pooling idea, although interpretable, doesn't seem to give easily to fine-tuning the network to other datasets. In finetuning, we usually replace and learn just the last layer. |
[link]
Neural Turing Machine (NTM) consists of a neural network controller interacting with a working memory bank in a learnable manner. This is analogous to computers — controllers = CPU (hidden activations as registers) and memory matrix = RAM. Key ideas: - Controller (modified RNN) interacts with external world via input and output vectors, and with memory via read and write "heads" - "Read" vector is a convex combination of row-vectors of $M_t$ (memory matrix at time $t$) — $r\_t = \sum w\_t(i) M\_t(i)$ where w_t is a vector of weightings over N memory locations - "Writing" is decomposed into 1) erasing and 2) adding - The write head produces the erase vector e_t and the add vector a_t along with the vector of weightings over memory locations w_t - $M\_t(i) = M\_{t-1}(i)[1 - w_t(i) e_t] + w\_t(i) a\_t$ - Erase and add vectors control which components of memory are updated, while weightings w_t control which locations are updated - Weight vectors are produced by an addressing mechanism - Content-based addressing - Each head produces length M key k_t that is compared to each vector M_t(i) by cosine similarity and a temperature parameter. The weightings are normalized (softmax). - Location-based addressing - Interpolation: Each head produces interpolation gate g_t that is used to blend between weighting at previous time step and the content weighting of current tilmestep $w^{g}\_t = g\_t w^{c}\_t + (1-g\_t)w\_{t-1}$ - Shift: Circular convolution (modulo N) with a shift weighting distribution, for example softmax over integer shift positions (say 3 locations) - Sharpening: Each head emits \gamma_t to sharpen the final weighting - Experiments on copy, repeat-copy, associative memory, N-gram emulator and priority sort ## Links - [Attention and Augmented RNNs](http://distill.pub/2016/augmented-rnns/) - [NTM-Lasagne](https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315) |
[link]
This paper presents R-CNN, an approach to do object detection using CNNs pre-trained for image classification. Object proposals are extracted from the image using Selective Search, dilated by few pixels, warped to CNN input size and fed into the CNN to extract features (they experiment with pool5, fc6, fc7). These extracted feature vectors are scored using SVMs, one per class. Bounding box regression, where they predict parameters to move the proposal closer to ground-truth, further boosts localization. The authors use AlexNet, pre-trained on ImageNet and finetuned for detection. Object proposals with IOU overlap greater than 0.5 are treated as positive examples, and others as negative, and a 21-way classification (20 object categories + background) is set up to finetune the CNN. After finetuning, SVMs are trained per class, taking only the ground-truth boxes as positives, and IOU <= 0.3 as negatives. R-CNN achieves major performance improvements on PASCAL VOC 2007/2010 and ILSVRC2013 detection datasets. Finally, this method is extended to do semantic segmentation and achieves competitive results. ## Strengths - The method is simple and effective. - Extensive ablation studies show why R-CNN works. - FC7 is the best feature to use (against pool5, fc6). - Fine-tuning provides a large boost in performance. - VGG performs better than AlexNet. - Bounding box regression further improves localization. ## Weaknesses / Notes - Each region proposal is treated independently, which adds up to compute time. - There are lots of different parts; the network can't be trained end-to-end. |
[link]
This paper presents a simple approach to predicting sequences from sequential input. They use a multi-layer LSTM-based encoder-decoder architecture and show promising results on the task of neural machine translation. Their approach beats a phrase-based statistical machine translation system by a BLEU score of > 1.0 and is close to state-of-the-art if used to re-rank 1000-best predictions from the SMT system. Main contributions: - The first LSTM encodes an input sequence to a single vector, which is then decoded by a second LSTM. End of sequence is indicated by a special character. - 4-layer deep LSTMs. - 160k source vocabulary, 80k target vocabulary. Trained on 12M sentences. Words in output sequence are generated by a softmax over fixed vocabulary. - Beam search is used at test time to predict translations (Beam size 2 does best). ## Strengths - Qualitative results (PCA projections) show that learned representations are fairly insensitive to active/passive voice, as sentences similar in meaning are clustered together. - Another interesting observation was that reversing the source sequence gives a significant boost to translation of long sentences and results in performance gain, most likely due to the introduction of short-term dependencies that are more easily captured by the gradients. ## Weaknesses / Notes - The reversing source input idea needs better justification, otherwise comes across as an 'ugly hack'. - To re-score the n-best list of predictions of the baseline, they average confidences of LSTM and baseline model. They should have reported re-ranking accuracies by using just the LSTM-model confidences. |
[link]
This paper proposes a modified convolutional network architecture by increasing the depth, using smaller filters, data augmentation and a bunch of engineering tricks, an ensemble of which achieves second place in the classification task and first place in the localization task at ILSVRC2014. Main contributions: - Experiments with architectures with different depths from 11 to 19 weight layers. - Changes in architecture - Smaller convolution filters - 1x1 convolutions: linear transformation of input channels followed by a non-linearity, increases discriminative capability of decision function. - Varying image scales - During training, the image is rescaled to set the length of the shortest side to S and then 224x224 crops are taken. - Fixed S; S=256 and S=384 - Multi-scale; Randomly sampled S from [256,512] - This can be interpreted as a kind of data augmentation by scale jittering, where a single model is trained to recognize objects over a wide range of scales. - Single scale evaluation: At test time, Q=S for fixed S and Q=0.5(S_min + S_max) for jittered S. - Multi-scale evaluation: At test time, Q={S-32,S,S+32} for fixed S and Q={S_min, 0.5(S_min + S_max), S_max} for jittered S. Resulting class posteriors are averaged. This performs the best. - Dense v/s multi-crop evaluation - In dense evaluation, the fully connected layers are converted to convolutional layers at test time, and the uncropped image is passed through the fully convolutional net to get dense class scores. Scores are averaged for the uncropped image and its flip to obtain the final fixed-width class posteriors. - This is compared against taking multiple crops of the test image and averaging scores obtained by passing each of these through the CNN. - Multi-crop evaluation works slightly better than dense evaluation, but the methods are somewhat complementary as averaging scores from both did better than each of them individually. The authors hypothesize that this is probably because of the different boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. ## Strengths - Thoughtful design of network architectures and experiments to study the effect of depth, LRN, 1x1 convolutions, pre-initialization of weights, image scales, and dense v/s multi-crop evaluations. ## Weaknesses / Notes - No analysis of how much time these networks take to train. - It is interesting how the authors trained a deeper model (D,E) by initializing initial and final layer parameters with those from a shallower model (A). - It would be interesting to visualize and see the representations learnt by three stacked 3x3 conv layers and one 7x7 conv layer, and maybe compare their receptive fields. - They mention that performance saturates with depth while going from D to E, but there should have been a more formal characterization of why that happens (deeper is usually better, yes? no?). - The ensemble consists of just 2 nets, yet performs really well. |
[link]
This paper introduces a novel visualization technique to understand representations learnt by intermediate layers of a deep convolutional neural network - DeconvNet. Using DeconvNet visualizations as a diagnostic tool in different settings, the authors propose changes to the model proposed by Alex Krizhevsky, which performs slightly better and generalizes well to other datasets. Key contributions: - Deconvolutional network - Feature activations are mapped back to input pixel space by setting other activations in the layer to zero and successively unpooling, rectifying and filtering (using the same parameters). - Unpooling is approximated by using switch variables to remember the location of highest input activation (and hence these visualizations are image-specific). - Rectification involves passing the signal through a ReLU non-linearity. - Filtering involves convolving the reconstructed signal with the transpose of the convolutional layer filters. - Well-designed experiments to provide insights ## Strengths - Observation of evolution of features - Visualizations clearly demonstrate that lower layers converge within a few epochs and upper layers develop after a considerable number of epochs (40-50). - Feature invariance - Visualizations show that small transformations have a dramatic effect on lower layers and lesser impact on higher layers. The model is fairly stable to translation and scaling, not so much to rotation. - Occlusion sensitivity analysis - Parts of the image are occluded, and posterior and activities are visualized. Clearly show that activities drop when the object is occluded. - Correspondence analysis - The intuition is that CNNs implicitly learn the correspondence between different parts. - To verify this, dog images with frontal pose are taken and the same part of the face is occluded in each of them. Then the difference in feature maps for each of those and the original image is calculated, and the consistency of this difference across all image pairs is verified by Hamming distance. Lower scores as compared to random occlusions does show that the model learns correspondences. - Proposed model performs better than Alex Krizhevsky's model, and generalizes well to other datasets. ## Weaknesses / Notes - The justification / intuition for choice of smaller filters wasn't convincing enough. - Why does removing layer 7 give better top-1 error rate on train and val? - Rotation invariance might be something worth looking into. |
[link]
This paper introduces a deep convolutional neural network (CNN) architecture that achieved record-breaking performance in the 2012 ImageNet LSVRC. Notably, it brings together a bunch of neat ideas in an end-to-end, trainable model. Main contributions: - Achieves state-of-the-art performance in ILSVRC-2012. - Makes available an efficient, parallelized GPU implementation of their model. - Describes in detail the features of their model that help in improving performance and reducing training time, along with extensive ablative studies. - Uses data augmentation and dropout to prevent overfitting. ## Strengths - Uses (and popularizes) ReLUs instead of tanh as the non-linear activation unit, which makes training six times faster. - Uses local response normalization and overlapped pooling. - Data augmentation - Extracts random crops and performs image translations, horizontal reflections maintaining the label distribution. - Alters RGB pixel values by performing PCA on training set, and adding multiples of eigenvalues times a random variable drawn from a Gaussian to image. Provides invariance to changes in intensity and color of illumination. - Dropout prevents overfitting. Randomly drops half of the neurons in the fully connected layers, and can be interpreted as averaging over exponentially-many dropout networks. ## Weaknesses / Notes - Lacks theoretical insight. Design decisions are motivated solely by results. |
[link]
- Turing, in his MIND paper in 1950, proposed an operational, behavioral alternative to the philosophical question "Can machines think?” by suggesting a simple "Turing test" where machines play the "imitation game” and humans are tasked with discerning machine from human given responses. He believed even partial success towards this goal given only 5 minutes of interaction would be hard and far-off. - The Turing test hasn’t yet been met (except in restricted settings like Siri, Watson), but his prediction "one will be able to speak of machines thinking without expecting to be contradicted” has proved true — "smart” computers have become commonplace. - One of the reasons Turing test hasn’t been met yet is because of the failures today’s intelligent systems make. Their capabilities are limited as type of questions they can handle, domains and their ability to handle unexpected input. Failure cases when "it doesn’t know that it doesn’t know” making humans exclaim how stupid it is. - There is a realization that computers and humans have separate strengths, weaknesses and roles. Also, language is inherently social and connected to communicative purpose and human cooperation. It is intentional behavior, and not just stimulus-response. Language also assumes that participants have models of each other, models that influence what they say and how they say it. Retrospectively speaking, Turing’s imitation game misses these aspects. "Jeopardy” was clever in avoiding dialogue context and modeling other people’s behavior. - Another big change: instead of input-output interactions with computers by humans, today humans + computers exist in "mixed” networks. - Desirable properties in today's Turing test: interactive nature + use of language in real use (than success in game) + human-machine collaboration - Proposed Turing test: "Is it imaginable that a computer (agent) team member could behave, over the long term and in uncertain, dynamic environments, in such a way that people on the team will not notice it is not human.” - This doesn’t ask machine to appear like human, act or be mistaken for one, but non-humanness shouldn’t hit people in the face. Behavior shouldn’t baffle teammates, leaving them wondering not about what it is thinking but whether it is. Such a system will also need a model of teammates’ knowledge, abilities, preferences, etc. |
[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect). - Model - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer. - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training. - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices. - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin. - ACT Dataset - 50 keywords, 43 classes, ~500 YouTube videos per keyword. - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"? - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes. - Experiments - Action recognition on UCF101, HMDB51, ACT. - Cross-category generalization on ACT. - Visualizations - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color. - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context. - Embedding retrievals based on transformed precondition embeddings. ** Thoughts ** - Modeling action as a transformation from precondition to effect is a very neat idea. - The exact formulation and supporting experiments and ablation studies are thorough. - During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. |