Abhishek Das's profile - ShortScience.org

arxiv.org
scholar.google.com

A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment
Yu, Haonan and Zhang, Haichao and Xu, Wei
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper proposes a framework where an agent learns to navigate a 2D maze-like
environment (XWORLD) from (templated) natural language commands, in the process
simultaneously learning visual representations, syntax and semantics of language and
performing navigation actions. The task is essentially VQA + navigation; at every step
the agent either gets a question about the environment or navigation command,
and the output is either a navigation action or answer. Key contributions:

- Grounding and recognition are tied together to be two versions of the same problem.
In grounding, given an image feature map and label (word), the problem is to find
regions of the image corresponding to word semantics (attention map); and in
recognition, given an image feature map and attention, the problem is to assign
a word label. And thus word embeddings (for grounding) and softmax layer weights
(for recognition) are tied together. This enables transferring concepts
learnt during recognition to navigation.
	- Further, recognition is modulated by question intent. For e.g. given an
	attention map that highlights an agent's west, should it be recognized as
	'west', 'apple' or 'red' (location, object or attribute)? It depends on what
	the question asks. Thus, GRU encoding of question produces an embedding mask
	that modulates recognition. The equivalent when grounding is that word embeddings
	are passed through fully-connected layers.

- Compositionality in language is exploited by performing grounding and
recognition by sequentially (softly) attending to parts of a sentence and
grounding in image. The resulting attention map is selectively combined
with attention from previous timesteps for final decision.

## Weaknesses / Notes

Although the environment is super simple, it's a neat framework and it is useful
that the target is specified in natural language (unlike prior/concurrent work
e.g. Zhu et al., ICRA17). The model gets to see a top-down centred view of the
entire environment at all times, which is a little weird.

arxiv.org
arxiv-vanity.com
scholar.google.com

A simple neural network module for relational reasoning
Adam Santoro and David Raposo and David G. T. Barrett and Mateusz Malinowski and Razvan Pascanu and Peter Battaglia and Timothy Lillicrap
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by Abhishek Das 7 years ago

This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities.
RN is a plug-and-play module and although expects object representations as input,
the semantics of what an object is need not be specified, so object representations
can be convolutional layer feature vectors or entity embeddings from text, or something else.
And the feedforward network is free to discover relations between objects (as opposed to being
hand-assigned specific relations).

- At its core, RN has two parts:
	- a feedforward network `g` that operates on pairs of object representations,
	for all possible pairs, all pairwise computations pooled via element-wise addition
	- a feedforward network `f` that operates on pooled features for downstream
	task, everything being trained end-to-end

- When dealing with pixels (as in CLEVR experiment), individual object representations are
spatially distinct convolutional layer features (196 512-d object representations for VGG conv5 say).
The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates,
shape, material, color, size.

- For bAbI, object representations are LSTM encodings of supporting sentences.

- For VQA tasks, `g` conditions its processing on question encoding as well, as relations
that are relevant for figuring out the answer would be question-dependent.


## Strengths

- Very simple idea, clearly explained, performs well. Somewhat shocked that it
hasn't been tried before.

## Weaknesses / Notes

Fairly simple idea — let a feedforward network
operate on all pairs of object representations and figure out relations
necessary for downstream task with end-to-end training. And it is fairly general in its design,
relations aren't hand-designed and neither are object representations — for
RGB images, these are spatially distinct convolutional layer features, for text,
these are LSTM encodings of supporting facts, and so on. This module can be dropped
in and combined with more sophisticated networks to improve performance at VQA.

RNs also offer an alternative design choice to prior works on CLEVR, that have
this explicit notion of programs or modules with specialized roles (that need to be pre-defined),
as opposed to letting these relations emerge, reducing dependency on hand-designing
modules and adding in inductive biases from an architectural point-of-view for
the network to reason about relations (earlier end-to-end VQA models didn't have
the capacity to figure out relations).

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Diverse and Natural Image Descriptions via a Conditional GAN
Bo Dai and Sanja Fidler and Raquel Urtasun and Dahua Lin
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abhishek Das 7 years ago

This paper proposes a conditional GAN-based image captioning model.
Given an image, the generator generates a caption, and given an image
and caption, the discriminator/evaluator distinguishes between generated
and real captions. Key ideas:

- Since caption generation involves sequential sampling, which is
non-differentiable, the model is trained with policy gradients, with
the action being the choice of word at every time step, policy being
the distribution over words, and reward the score assigned by the
evaluator to generated caption.

- The evaluator's role assumes a completely generated caption as input
(along with image), which in practice leads to convergence issues. Thus
to accommodate feedback for partial sequences during training, Monte Carlo
rollouts are used, i.e. given a partial generated sequence, n completions
are sampled and run through the evaluator to compute reward.

- The evaluator's objective function consists of three terms
    - image-caption pairs from training data (positive)
    - image and generated captions (negative)
    - image and sampled captions for other images from training data (negative)

- Both the generator and evaluator are pretrained with supervision / MLE, then
fine-tuned with policy gradients. During inference, evaluator score is used as
the beam search objective.

## Strengths

This is neat paper with insightful ideas (Monte Carlo rollouts for assigning
rewards to partial sequences, evaluator score as beam search objective),
and is perhaps the first work on C-GAN-based image captioning.

## Weaknesses / Notes

arxiv.org
scholar.google.com

Building Machines That Learn and Think Like People
Brenden M. Lake and Tomer D. Ullman and Joshua B. Tenenbaum and Samuel J. Gershman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.AI, cs.CV, cs.LG, cs.NE, stat.ML
more

[link] Summary by Abhishek Das 7 years ago

This paper performs a comparitive study of recent advances in deep learning with human-like learning from a cognitive science point of view. Since natural intelligence is still the best form of intelligence, the authors list a core set of ingredients required to build machines that reason like humans.

- Cognitive capabilities present from childhood in humans.
- Intuitive physics; for example, a sense of plausibility of object trajectories, affordances.
- Intuitive psychology; for example, goals and beliefs.
- Learning as rapid model-building (and not just pattern recognition).
- Based on compositionality and learning-to-learn.
- Humans learn by inferring a general schema to describe goals, object types and interactions. This enables learning from few examples.
- Humans also learn richer conceptual models.
- Indicator: variety of functions supported by these models: classification, prediction, explanation, communication, action, imagination and composition.
- Models should hence have strong inductive biases and domain knowledge built into them; structural sharing of concepts by compositional reuse of primitives.
- Use of both model-free and model-based learning.
- Model-free, fast selection of actions in simple associative learning and discriminative tasks.
- Model-based learning when a causal model has been built to plan future actions or maximize rewards.
- Selective attention, augmented working memory, and experience replay are low-level promising trends in deep learning inspired from cognitive psychology.
- Need for higher-level aforementioned ingredients.

arxiv.org
scholar.google.com

Deep Compositional Question Answering with Neural Module Networks
Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions:

- Independent neural modules that can be combined together and jointly trained.
- Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc.
- Re-attention: FC-ReLU-FC-ReLU, weights are different for different instances. For example, re-attend[above], re-attend[not], etc.
- Combination: Stacks two attention maps, followed by conv-ReLU to map to a single attention map. For example, combine[and], combine[except], etc.
- Classification: Combines attention map and image, followed by FC-Softmax to map to answer. For example, classify[colors].
- Measurement: FC-ReLU-FC-Softmax, takes attention map as input. For example, measure[exists].

- Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them.
- All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.
- Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]$attend[cat]$, classify[where]$attend[truck]$.

- Predictions from the module network are combined with LSTM representations to get the final answer.
- Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network.
- Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'.

- Experiments are performed on the synthetic SHAPES dataset and VQA dataset.
- Performance on the SHAPES dataset is better as it is designed to benefit from compositionality.

## Strengths

- This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach.

## Weaknesses / Notes

- Mapping from syntactic structure to module network is hand-designed. Ideally, the model should learn this too to generalize.

- Due to its compositional nature, this kind of model can possibly be used in the zero-shot learning setting, i.e. generalize to novel question types that the network hasn't seen before.

arxiv.org
scholar.google.com

Deep Networks with Stochastic Depth
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: deeplearning, acreuser

[link] Summary by Abhishek Das 7 years ago

This paper presents a way to reduce the expected network depth of deep residual networks during training by randomly dropping a subset of residual blocks and bypassing them with identity connections. The 'survival' probability $p\_l$ decreases linearly with depth (from 1.0 to 0.5 at last layer) so as to keep layers that extract low-level features with higher probability. At test time, residual block functions are scaled by the expected number of times it appears during training, i.e. $p\_l$. This model achieves lower test errors than ResNets (with ReLU activations) on CIFAR-10, CIFAR-100 and SVHN.

## Strengths

- Shorter expected depth leads to faster training (>25% speedup).

- Helps reduce the vanishing gradient problem as shown by the mean gradient magnitude v/s epochs plot.

- Linear decay of survival probability works better than uniform survival, which supports the intuition that low-level features need to be reliably present.

- Stochastic depth acts as a regularizer. The 1202-layer stochastic depth residual network shows improvements over the 110-layer network, while the original ResNets paper reports overfitting and higher test error with 1000+ layers.

## Weaknesses / Notes

- Test errors for the updated ResNet architecture (ReLU activation inside residual function) are missing. That should perform better. Also, numbers on ImageNet.

- Stochastic depth can be interpreted as sequential ensembling as compared to parallel ensembles.

- It would be interesting to look at the filters learnt by stochastic depth residual networks, and to understand whether/how these networks learn hierarchical features as compared to the conventional CNN intuitions of compositionality.

aclweb.org
scholar.google.com

Deep Reinforcement Learning for Dialogue Generation
Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses.

Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards:

1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward).
2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better).
3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question.

The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward).

Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on

1. Which of two outputs has better quality (single turn)
2. Which of two outputs is easier to respond to, and
3. Which of two conversations have better quality (multi turn).

## Strengths

- Interesting results
- Avoids generic responses
- 'Ease of responding' reward encourages responses to be question-like
- Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat.
- Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response.

## Weaknesses / Notes

- Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties.

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces Residual Nets (ResNets), which was the
winning submission (152-layer deep) at ILSVRC 2015 and MS-COCO 2015, and achieves
a top-5 error rate of 3.57% (ensemble of two nets). Main contributions:

- The key idea is that deeper networks face the degradation problem, i.e.
higher training and test error than shallower nets, because they're harder
to optimize for approximating identity mapping by multiple non-linear layers.
- They mitigate this problem by forcing solvers to learn residual functions
i.e. $f(x) = H(x) - x$, by adding shortcut connections. If identity mapping is
the optimal formulation, the learned weights should drive $f(x)$ to 0 (and they
observe that this is a suitable preconditioning as most residual function responses
are small).

- Shortcut connections (for identity mapping) don't require additional parameters.
- Size transformations are done by zero-padding (no parameters) or projections. Projections
introduce additional parameters and perform slightly better.

- Bottleneck design is used to further reduce computational complexity, i.e. 1x1 convolutional
layers before and after 3x3 convolutions to reduce and increase dimensions.

- For detection and localization tasks, they use ResNets in the Faster-RCNN setting.

## Strengths

- ResNets are significantly deeper and more accurate yet computationally cheaper than VGG.

- A single ResNet outperforms previous state-of-the-art ensembles. Their final winning submission
is an ensemble of two networks.

## Weaknesses / Notes

- The idea of shortcut connections to force blocks to learn residual functions preconditioned
on identity mapping is neat, and more so because it doesn't require additional parameters.

- A lot of results and design decisions merit further investigation and reasoning.
- Why do shortcuts skip 2 or 3 layers? What happens to performance if we increase the number of layers skipped?
- How well do shortcut connections work with Inception modules? The statistical principles
underlying both these architectures seem to be orthogonal, does performance further improve?
- 152 seems to be an arbitrary number of layers that 'worked'.

- The degradation problem seen when making networks deeper by initializing
layers with identity weight matrices seems to be contradictory to the results
presented in the Net2Net paper.

arxiv.org
arxiv-vanity.com
scholar.google.com

Delving Deeper into Convolutional Networks for Learning Video Representations
Nicolas Ballas and Li Yao and Chris Pal and Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper presents a neat method for learning spatio-temporal representations from videos. Convolutional features from intermediate layers of a CNN are extracted, to preserve spatial resolution, and fed into a modified GRU that can (in theory) learn infinite temporal dependencies. Main contributions:

- Their variant of GRU (called GRU-RCN) uses convolution operations instead of fully-connected units.
- This exploits the local correlation in image frames across spatial locations.
- Features from pool2, pool3, pool4, pool5 are extracted and fed into independent GRU-RCNs. Hidden states at last time step are now feature volumes, which are average pooled to reduce to 1x1 spatially, and fed into a linear + softmax classifier. Outputs from each of these classifiers is averaged to get the final prediction.

- Other variants that they experiment with are bidirectional GRU-RCNs and stacked GRU-RCNs i.e. GRU-RCNs with connections between them (with maxpool operations for dimensionality reduction).
- Bidirectional GRU-RCNs perform the best.
- Stacked GRU-RCNs perform worse than the other variants, probably because of limited data.

- They evaluate their method on action recognition and video captioning, and show significant improvements on a CNN+RNN baseline, comparing favorably with other state-of-the-art methods (like C3D).

## Strengths

- The idea is simple and elegant. Earlier methods for learning video representations typically used 3D convolutions (k x k x T filters), which suffered from finite temporal capacity, or RNNs sitting on top of last-layer CNN features, which is unable to capture finer spatial resolution. In theory, this formulation solves both.

- Changing fully-connected operations to convolutions has the additional advantage of requiring lesser parameters (n\_input x n\_output x input\_width x input\_height v/s n\_input x n\_output x k\_width x k\_height).

arxiv.org
arxiv-vanity.com
scholar.google.com

Dynamic Capacity Networks
Amjad Almahairi and Nicolas Ballas and Tim Cooijmans and Yin Zheng and Hugo Larochelle and Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper presents a model that can dynamically split computation across coarse, low-capacity sub-networks and fine, high-capacity sub-networks. The coarse model processes the entire input data and is typically shallow while the fine model focuses on a few important regions of the input and is deeper. For images as input, this is a hard attention mechanism that can be trained with stochastic gradient descent and doesn't require a task-specific attention policy trained by reinforcement learning. Key ideas:

- A deep network h can be decomposed into bottom layers f and top layers g such that $h(x) = g(f(x))$. Further, f consists of two alternate sub-networks $f\_c$ and $f\_f$. $f\_c$ is a low-capacity sub-network while $f\_f$ is a high-capacity sub-network.

- g should be able to use representations from $f\_c$ and $f\_f$ dynamically. $f\_c$ processes the entire input while $f\_f$ only a few important regions of the input.

- The coarse model processes the entire input and the norm of the gradient of the entropy with respect to the coarse vector at each spatial region is computed which is a measure of saliency. The use of the entropy gradient as a saliency measure encourages selecting input regions that could affect the uncertainty in the model’s predictions the most.

- The top-k input regions with highest saliency values are processed by the fine model. The refined representation for input to the top layers consists of both coarse and fine vectors. During backpropagation, gradients are computed for the refined model, i.e. propagating gradients at each position into either the coarse or fine features, depending on which was used.

- To make sure $f\_c$ and $f\_f$ representations are interchangeable and input to the top layers has smooth transitions, an additional objective term minimizes the squared distance between coarse and fine representations and this additional term is used only to optimize the coarse layers, not the fine layers.

- Experiments on cluttered MNIST, SVHN and comparison with RAM, DRAW and study with various values of number of patches for fine processing.

## Strengths

- Neat, general way to split computation based on importance of input; a hard-attention mechanism that can be trained with SGD, unlike RAM.

- Entropy gradient as a measure of saliency is an interesting idea, and it doesn't need labels i.e. can be used at test time.

dx.doi.org
sci-hub
scholar.google.com

Identity Mappings in Deep Residual Networks
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This is follow-up work to the ResNets paper. It studies the propagation formulations behind the connections of deep residual networks and performs ablation experiments. A residual block can be represented with the equations $y_l = h(x_l) + F(x_l, W_l); x_{l+1} = f(y_l)$. $x_l$ is the input to the l-th unit and $x_{l+1}$ is the output of the l-th unit. In the original ResNets paper, $h(x_l) = x_l$, $f$ is ReLu, and F consists of 2-3 convolutional layers (bottleneck architecture) with BN and ReLU in between. In this paper, they propose a residual block with both $h(x)$ and $f(x)$ as identity mappings, which trains faster and performs better than their earlier baseline. Main contributions:

- Identity skip connections work much better than other multiplicative interactions that they experiment with:
    - Scaling $(h(x) = \lambda x)$: Gradients can explode or vanish depending on whether modulating scalar \lambda > 1 or < 1.
    - Gating ($1-g(x)$ for skip connection and $g(x)$ for function F):
    For gradients to propagate freely, $g(x)$ should approach 1, but
    F gets suppressed, hence suboptimal. This is similar to highway
    networks. $g(x)$ is a 1x1 convolutional layer.
    - Gating (shortcut-only): Setting high biases pushes initial $g(x)$
    towards identity mapping, and test error is much closer to baseline.
    - 1x1 convolutional shortcut: These work well for shallower networks
    (~34 layers), but training error becomes high for deeper networks,
    probably because they impede gradient propagation.

- Experiments on activations.
    - BN after addition messes up information flow, and performs considerably
    worse.
    - ReLU before addition forces the signal to be non-negative, so the signal is monotonically increasing, while ideally a residual function should be free to take values in (-inf, inf).
    - BN + ReLU pre-activation works best. This also prevents overfitting, due
    to BN's regularizing effect. Input signals to all weight layers are normalized.

## Strengths

- Thorough set of experiments to show that identity shortcut connections
are easiest for the network to learn. Activation of any deeper unit can
be written as the sum of the activation of a shallower unit and a residual
function. This also implies that gradients can be directly propagated to
shallower units. This is in contrast to usual feedforward networks, where
gradients are essentially a series of matrix-vector products, that may vanish, as networks grow deeper.

- Improved accuracies than their previous ResNets paper.

## Weaknesses / Notes

- Residual units are useful and share the same core idea that worked in
LSTM units. Even though stacked non-linear layers are capable of asymptotically
approximating any arbitrary function, it is clear from recent work that
residual functions are much easier to approximate than the complete function.
The [latest Inception paper](http://arxiv.org/abs/1602.07261) also reports
that training is accelerated and performance is improved by using identity
skip connections across Inception modules.

- It seems like the degradation problem, which serves as motivation for
residual units, exists in the first place for non-idempotent activation
functions such as sigmoid, hyperbolic tan. This merits further
investigation, especially with recent work on function-preserving transformations such as [Network Morphism](http://arxiv.org/abs/1603.01670), which expands the Net2Net idea to sigmoid, tanh, by using parameterized activations, initialized to identity mappings.

arxiv.org
scholar.google.com

Net2Net: Accelerating Learning via Knowledge Transfer
Chen, Tianqi and Goodfellow, Ian J. and Shlens, Jonathon
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents a simple method to accelerate the training
of larger neural networks by initializing them with parameters
from a trained, smaller network. Networks are made wider or deeper
while preserving the same output as the smaller network which
maintains performance when training starts, leading to faster
convergence. Main contributions:

- Net2Deeper
    - Initialize layers with identity weight matrices
    to preserve the same output.
    - Only works when activation function $f$ satisfies
    $f(If(x)) = f(x)$ for example ReLU, but not sigmoid, tanh.

- Net2Wider
    - Additional units in a layer are randomly sampled
    from existing units. Incoming weights are kept the same
    while outgoing weights are divided by the number of
    replicas of that unit so that the output at the next layer
    remains the same.

- Experiments on ImageNet
    - Net2Deeper and Net2Wider models converge faster to the
    same accuracy as networks initialized randomly.
    - A deeper and wider model initialized with Net2Net from
    the Inception model beats the validation accuracy (and
    converges faster).

## Strengths

- The Net2Net technique avoids the brief period of low performance that exists in
methods that initialize some layers of a deeper network from a trained
network and others randomly.

- This idea is very useful in production systems which essentially have to
be lifelong learning systems. Net2Net presents an easy way to immediately
shift to a model of higher capacity and reuse trained networks.

- Simple idea, clearly presented.


## Weaknesses / Notes

- The random mapping algorithm for different layers was done manually
for this paper. Developing a remapping inference algorithm should be
the next step in making the Net2Net technique more general.

- The final accuracy that Net2Net models achieve seems to depend only
on the model capacity and not the initialization. I think this merits
further investigation. In this paper, it might just be because of randomness
in training (dropout) or noise added to the weights of the new units to
approximately represent the same function (when not using dropout).

dx.doi.org
sci-hub
scholar.google.com

Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Johnson, Justin and Alahi, Alexandre and Fei-Fei, Li
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper proposes the use of pretrained convolutional neural networks that have already learned to encode semantic information as loss functions for training networks for style transfer and super-resolution. The trained networks corresponding to selected style images are capable of performing style transfer for any content image with a single forward pass (as opposed to explicit optimization over output image) achieving as high as 1000x speedup and similar qualitative results as Gatys et al. Key contributions:

- Image transformation network
    - Convolutional neural network with residual blocks and strided & fractionally-strided convolutions for in-network downsampling and upsampling. 
    - Output is the same size as input image, but rather than training the network with a per-pixel loss, it is trained with a feature reconstruction perceptual loss.

- Loss network
    - VGG-16 with frozen weights 
    - Feature reconstruction loss: Euclidean distance between feature representations
    - Style reconstruction loss: Frobenius norm of the difference between Gram matrices, performed over a set of layers.

- Experiments
    - Similar objective values and qualitative results as explicit optimization over image as in Gatys et al for style transfer
    - For single-image super-resolution, feature reconstruction loss reconstructs fine details better and 'looks' better than a per-pixel loss, even though PSNR values indicate otherwise. Respectable results in comparison to SRCNN.

## Weaknesses / Notes

- Although fast, limited by styles at test-time (as opposed to iterative optimizer that is limited by speed and not styles). Ideally, there should be a way to feed in style and content images, and do style transfer with a single forward pass.

arxiv.org
scholar.google.com

Recurrent Batch Normalization
Cooijmans, Tim and Ballas, Nicolas and Laurent, César and Courville, Aaron
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents a re-parameterization of the LSTM to successfully apply batch normalization, which results in faster convergence and improved generalization on a several sequential tasks. Main contributions:

- Batch normalization is applied to the input to hidden and hidden to hidden projections.
    - Separate statistics are maintained for each timestep, estimated over each minibatch during training and over the whole dataset during test.
    - For generalization to longer sequences during test time, population statistics of time T\_max are used for all time steps beyond it.
    - The cell state is left untouched so as not to hinder the gradient flow.

- Proper initialization of batch normalization parameters to avoid vanishing gradients.
    - They plot norm of gradient of loss wrt hidden state at different time steps for different BN variance initializations. High variance ($\gamma = 1$) causes gradients to die quickly by driving activations to the saturation region.
    - Initializing BN variance to 0.1 works well.

## Strengths

- Simple idea, the authors finally got it to work. Proper initialization of BN parameters and maintaining separate estimates for each time step play a key role.

## Weaknesses / Notes

- It would be useful in practice to put down a proper formulation for using batch normalization with variable-length training sequences.

arxiv.org
arxiv-vanity.com
scholar.google.com

Residual Networks are Exponential Ensembles of Relatively Shallow Networks
Andreas Veit and Michael Wilber and Serge Belongie
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper introduces an interpretation of deep residual networks as implicit ensembles of exponentially many shallow networks. For a residual block $i$, there are $2^{i-1}$ paths from input to $i$, and the input to $i$ is a mixture of $2^{i-1}$ different distributions. The interpretation is backed by a number of experiments such as removing or re-ordering residual blocks at test time and plotting norm of gradient v/s number of residual blocks the gradient signal passes through. Removing $k$ residual blocks (for k <= 20) from a network of depth n decreases the number of paths to $2^{n-k}$ but there are still sufficiently many valid paths to not hurt classification error, whereas sequential CNNs have a single viable path which gets corrupted. Plot of gradient at input v/s path length shows that almost all contributions to the gradient come from paths shorter than 20 residual blocks, which are the effective paths. The paper concludes by saying that network 'multiplicity', which is the number of paths, plays a key role in terms of the network's expressability.

## Strengths

- Extremely insightful set of experiments. These experiments nail down the intuitions as to why residual networks work, as well as clarify the connections with stochastic depth (sampling the network multiplicity during training i.e. ensemble by training) and highway networks (reduction in number of available paths by gating both skip connections and paths through residual blocks).

## Weaknesses / Notes

- Connections between effective paths and model compression.

arxiv.org
arxiv-vanity.com
scholar.google.com

Residual Networks of Residual Networks: Multilevel Residual Networks
Ke Zhang and Miao Sun and Tony X. Han and Xingfang Yuan and Liru Guo and Tao Liu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abhishek Das 7 years ago

This paper introduces a modification to the ResNets architecture with multi-level shortcut connections (shortcut from input to pre-final layer as level 1, shortcut over each residual block group as level 2, etc) as opposed to single-level shortcut connections in prior work on ResNets. The authors perform experiments with multi-level shortcut connections on regular ResNets, ResNets with pre-activations and Wide ResNets. Combined with drop-path regularization via stochastic depth and exploration over optimal shortcut level number and optimal depth/width ratio to avoid vanishing gradients and overfitting, this architecture achieves state-of-the-art error rates on CIFAR-10 (3.77%), CIFAR-100 (19.73%) and SVHN (1.59%).

## Strengths

- Fairly exhaustive set of experiments over
    - Shortcut level numbers.
    - Identity mapping types: 1) zero-padding shortcuts, 2) 1x1 convolutions for projections and others identity, and 3) all 1x1 convolutions.
    - Residual block size (2 or 3 3x3 convolutional layers).
    - Depths (110, 164, 182, 218) and widths for both ResNets and Pre-ResNets.

papers.nips.cc
scholar.google.com

Deep Visual Analogy-Making
Reed, Scott E. and Zhang, Yi and Zhang, Yuting and Lee, Honglak
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces an end-to-end trainable neural model capable of performing analogical reasoning in image representations followed by decoding back to image space. Specifically, given a 4-tuple A:B::C:D, the task is to apply the transformation A:B to C. The motivation is clear — humans are excellent at generalizing to hypothetical transformations about images ("what if this chair were rotated 30 degrees clockwise?").

- The objective function follows directly from vector addition: $MSE(d - g(f(b) - f(a) + f(c)))$ where $f$ and $g$ are convolutional neural networks.

- In case of rotation, a purely additive transformation is not optimal because repeated application of this transformation to the same query image will never return to the original point. Instead, multiplicative interactions or MLPs are used to condition the transformation on $c$ as well.

- Analogy-making is also performed on disentangled representations, which separate factors of variation to separate coordinates and are learnt from distinct images $a,b, c$ such that the objective is $MSE(c - g(s . f(a) + (1-s) . f(b)))$ where $s$ are switch variables to disentangle features. Disentangled image features allow the analogy-making model to traverse the manifold of a given factor or subset of factors.

- Experiments on transforming shapes, generating 2D video game sprites and 3D car renderings.

## Strengths

- Neat idea, well-presented

arxiv.org
arxiv-vanity.com
scholar.google.com

DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson and Andrej Karpathy and Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by Abhishek Das 7 years ago

This paper introduces the task of dense captioning and proposes
a network architecture that processes an image and produce region descriptions
in a single pass and can be trained end-to-end. Main contributions:

- Dense captioning
    - Generalization of object detection (caption consists of single word)
    and image captioning (region consists of whole image).

- Fully convolution localization network
    - Fully differentiable, can be trained jointly with the rest of the network
    - Consists of a region proposal network, box regression (similar to Faster R-CNN)
    and bilinear interpolation (similar to Spatial Transformer Networks) for
    sampling.

- Network details
    - Convolutional layer features are extracted for image
    - For each element in the feature map, k anchor boxes of different aspect ratios
    are selected in the input image space.
    - For each of these, the localization layer predicts offsets and confidence.
    - The region proposals are projected on the convolutional feature map and a sampling
    grid is computed from output feature map to input (bilinear sampling).
    - The computed feature map is passed through an MLP to compute representations
    corresponding to each region.
    - These are passed (in a batch) as the first word to an LSTM (Show and Tell) which
    is trained to predict each word of the caption.

## Strengths

- Fully differentiable 'spatial attention' mechanism (bilinear interpolation)
in place of RoI pooling as in the case of Faster R-CNN.
    - RoI pooling is not differentiable with respect to the input proposal coordinates.

- Fast, and impressive qualitative results.

## Weaknesses / Notes

The model is very well engineered together from different works (Faster R-CNN +
Spatial Transformer Networks + Show & Tell).

jmlr.org
scholar.google.com

DRAW: A Recurrent Neural Network For Image Generation
Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces a neural network architecture
that generates realistic images sequentially. They
also introduce a differentiable attention mechanism
that allows the network to focus on local regions of the image
during reconstruction. Main contributions:

- The network architecture is similar to other variational
auto-encoders, except that
    - The encoder and decoder are recurrent networks (LSTMs).
    The encoder's output is conditioned on the decoder's
    previous outputs, and the decoder's outputs are iteratively
    added to the resulting distribution from which images are
    generated.
    - The spatial attention mechanism restricts the input region
    observed by the encoder and available to write for the decoder.

## Strengths

- The spatial soft attention mechanism is effective and fully differentiable,
and can be used for other tasks.

- Images generated by DRAW look very realistic.

## Weaknesses / Notes

arxiv.org
scholar.google.com

Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces an attention mechanism (soft memory access)
for the task of neural machine translation. Qualitative and quantitative
results show that not only does their model achieve state-of-the-art BLEU
scores, it performs significantly well for long sentences which was a
drawback in earlier NMT works. Their motivation comes from the fact that
encoding all information from an input sentence into a single fixed length
vector and using that in the decoder was probably a bottleneck. Instead,
their decoder uses an attention vector, which is a weighted sum of the
input hidden states, and is learned jointly. Main contributions:

- The encoder is a bidirectional RNN, in which they take the annotation
of each word to be the concatenation of the forward and backward RNN states.
The idea is that the hidden state should encode information from both the
previous and following words.

- The proposed attention mechanism is a weighted sum of the input hidden
states, the weights for which come from an attention function (a single-layer
perceptron, which takes as input the previous hidden state of the decoder and
the current word annotation from the encoder) and are softmax-normalized.

## Strengths

- Incorporating the attention mechanism shows large improvements on
longer sentences. The attention matrix is easily interpretable as well,
and visualizations in the paper show that higher weights are being assigned
to input words that correspond to output words irrespective of their order
in the sequence (unlike an attention model that uses a mixture of Gaussians
which is monotonic).

## Weaknesses / Notes

- Their model formulation to capture long-term dependencies is far more
principled than Sutskever et al's inverting the input idea. They should
have done a comparative study with their approach as well though.

arxiv.org
arxiv-vanity.com
scholar.google.com

Object Detectors Emerge in Deep Scene CNNs
Bolei Zhou and Aditya Khosla and Agata Lapedriza and Aude Oliva and Antonio Torralba
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper hypothesizes that a CNN trained for scene classification automatically
discovers meaningful object detectors, representative of the scene categories,
without any explicit object-level supervision. This claim is backed by well-designed
experiments which are a natural extension of the primary insight that since scenes
are composed of objects (a typical bedroom would have a bed, lamp; art gallery would
have paintings, etc), a CNN that performs reasonable well on scene recognition
must be localizing objects in intermediate layers.

## Strengths

- Demonstrates the difference in learned representations in Places-CNN and ImageNet-CNN.
- The top 100 images that have the largest average activation per layer are picked and it's shown that earlier layers such as pool1 prefer similar images for both networks while deeper layers tend to be more specialized to the specific task of scene or object categorization i.e. ~75% of the top 100 images that show high activations for fc7 belong to ImageNet for ImageNet-CNN and Places for Places-CNN.
- Simplifies input images to identify salient regions for classification.
- The input image is simplified by iteratively removing segments that cause the least decrease in classification score until the image is incorrectly classified. This leads them to the minimal image representation (sufficient and necessary) that is needed by the network to correctly recognize scenes, and many of these contain objects that provide discriminative information for scene classification.
- Visualizes the 'empirical receptive fields' of units.
- The top K images with highest activations for a given unit are identified. To identify which regions of the image lead to high unit activations, the image is replicated with occluders at different regions. The occluded images are passed through the network and large changes in activation indicate important regions. This leads them to generate feature maps and finally to empirical receptive fields after appropriate centre-calibration, which are more localized and smaller than the theoretical size.
- Studies the visual concepts / semantics captured by units.
- AMT workers are surveyed on the segments that maximally activate units. They're asked to tag the visual concept, mark negative samples and provide the level of abstraction (from simple elements and colors to objects and scenes). Plot of distribution of semantic categories at each layer shows that deeper layers do capture higher levels of abstraction and Places-CNN units indeed discover more objects than ImageNet-CNN units.

## Weaknesses / Notes

- Unclear as to how they obtain soft, grayed out images from the iterative segmentation methodology in the first approach where they generate minimal image representations needed for accurate classification. I would assume these regions to be segmentations with black backgrounds and hard boundaries. Perez et al. (2013) might have details regarding this.

papers.nips.cc
scholar.google.com

Spatial Transformer Networks
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces a neural networks module that can learn input-dependent
spatial transformations and can be inserted into any neural network. It supports
transformations like scaling, cropping, rotations, and non-rigid deformations.
Main contributions:

- The spatial transformer network consists of the following:
    - Localization network that regresses to the transformation parameters
    given the input.
    - Grid generator that uses the transformation parameters to produce a
    grid to sample from the input.
    - Sampler that produces the output feature map sampled from the input
    at the grid points.

- Differentiable sampling mechanism
    - The sampling is written in a way such that sub-gradients can be defined
    with respect to grid coordinates.
    - This enables gradients to be propagated through the grid generator and
    localization network, and for the network to jointly learn the spatial
    transformer along with rest of the network.

- A network can have multiple STNs
    - at different points in the network, to model incremental transformations
    at different levels of abstraction.
    - in parallel, to learn to focus on different regions of interest. For example,
    on the bird classification task, they show that one STN learns to be a head detector,
    while the other focuses on the central part of the body.

## Strengths

- Their attention (and by extension transformation) mechanism is differentiable
as opposed to earlier works on non-differentiable attention mechanisms that used
reinforcement learning (REINFORCE). It also supports a richer variety of
transformations as opposed to earlier works on learning transformations, like DRAW.

- State-of-the-art classification performance on distorted MNIST, SVHN, CUB-200-2011.

## Weaknesses / Notes

This is a really nice way to generalize spatial transformations in a differentiable
manner so the model can be trained end-to-end. Classification performance, and more
importantly, qualitative results of the kind of transformations learnt on larger datasets
(like ImageNet) should be evaluated.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Stacked Attention Networks for Image Question Answering
Yang, Zichao and He, Xiaodong and Gao, Jianfeng and Deng, Li and Smola, Alexander J.
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces a Stacked Attention Network (SAN) for visual question answering.
SAN uses a multiple layer attention mechanism that uses the semantic question representation
to query the image and locate relevant visual regions, and to infer the answer.
Details of the SAN model:

- Image features are extracted from the last pooling layer of a deep CNN (like VGG-net).
- Input images are first scaled to 448 x 448, so at the last pooling layer, features
have the dimension 14 x 14 x 512 i.e. 512-dimensional vectors at each image location
with a receptive field of 32 x 32 in input pixel space.

- Question features are the last hidden state of the LSTM.
- Words are one-hot encoded, transferred to a vector space by passing through an
embedding matrix and these word vectors are fed into the LSTM at each time step.

- Image and question features are combined into a query vector to locate relevant visual regions.
- Both the LSTM hidden state and 512-d image feature vector at each location are transferred
to the same dimensionality (say k) by a fully connected layer, and added and passed through
a non-linearity (tanh).
- Each k-dimensional feature vector is then transformed down to a single scalar and
a softmax is taken over all image regions to get the attention distribution (say p\_{I}).
- This attention distribution is used to weight the pooling layer visual features (\sum_{i}p\_{i}v\_{i})
and added to the LSTM vector to get a new query vector.
- In subsequent attention layers, this updated query vector is used to repeat the same process
of getting an attention distribution.
- The final query vector is used to compute a softmax over the answers.

## Strengths

- The multi-layer attention mechanism makes sense intuitively and the qualitative results
somewhat indicate that going from the first attention layer to subsequent attention layers,
the network is able to focus on fine-grained visual regions as it discovers relationships
among multiple objects ('what are sitting in the basket on a bicycle').

- SAN benefits VQA, they demonstrate state-of-the-art accuracies on multiple datasets, with
question-type breakdown as well.

## Weaknesses / Notes

- Right now, the attention distribution is learnt in an unsupervised manner by the network.
It would be interesting to think about adding supervisory attention signal. Another way to
improve accuracies would be to use deeper LSTMs.

arxiv.org
arxiv-vanity.com
scholar.google.com

Striving for Simplicity: The All Convolutional Net
Jost Tobias Springenberg and Alexey Dosovitskiy and Thomas Brox and Martin Riedmiller
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.LG, cs.CV, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper simplifies the convolutional network proposed
by Alex Krizhevsky by replacing max-pooling with strided
convolutions (under the assumption that max-pooling is
required only for dimensionality reduction). They also
propose a novel technique for visualizing representations
learnt by intermediate layers that produces nicer visualizations
in input pixel space than DeconvNet (Zeiler et al) and Saliency
map (Simonyan at al) approaches.

## Strengths

- Their model performs at par or better than the original AlexNet formulation.
    - Max-pooling replaced by convolution with stride 2
    - Fully-connected layers replaced by 1x1 convolutions and global averaging + softmax
    - Smaller filter size (same intuition as VGGNet paper)
- Combining the DeconvNet (Zeiler et al.) and backpropagation (Simonyan et al.) approaches
at the ReLU operator (which is the only point of difference) by masking out values where at
least one of input activation or output reconstruction is negative (guided backprop) is neat
and leads to nice visualizations.

## Weaknesses / Notes

- Saliency maps generated from guided backpropagation definitely look much better
as compared to DeconvNet visualizations and saliency maps from Simonyan et al's paper.
It works better probably because the negative saliency values only arise from the very
first convolution, since negative error signals are never propagated back through the
non-linearities.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.

arxiv.org
scholar.google.com

Convolutional Neural Networks for Sentence Classification
Kim, Yoon
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper reports on a series of experiments with CNNs trained
on top of pre-trained word vectors for sentence-level classification
tasks. The model achieves very good performance across datasets, and
state-of-the-art on a few. The proposed model has an input layer
comprising of concatenated 'word2vec' embeddings, followed by a single
convolutional layer with multiple filters, max-pooling over time,
fully connected layers and softmax. They also experiment with static
and non-static channels which basically implies whether they finetune
word2vec embeddings or not.

## Strengths

- Very simple yet powerful model formulation, which achieves really good
performance across datasets.

- The different model formulations drive home the point that initializing
input vectors with word2vec embeddings is better than random initializations.
Finetuning these embeddings for the task leads to further improvements over
static embeddings.

## Weaknesses / Notes

- No intuition as to why the model with both static and non-static channels
gives mixed results.

- They briefly mention that they experimented with SENNA embeddings which lead
to worse results although no quantitative results are provided. It would have been
interesting to have a comparative study with GloVe embeddings as well.

arxiv.org
scholar.google.com

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper attempts to understand the representations learnt by deep
convolutional neural networks by introducing two interpretable visualization
techniques. Main contributions:

- Class model visualizations
    - These are obtained by making numerical optimizations in the input
    space to maximize the class score. Gradients are calculated wrt input
    and are used to update the input image (initialized with zero image),
    while weights are kept fixed to those obtained from training.
- Image-specific saliency map visualizations
    - These are approximated by using the same gradient as before (gradient
    of class score wrt input). The absolute pixel-wise max across channels produces
    the saliency map.
- Relation between DeconvNet and optimization-based visualizations
    - Visualizations using DeconvNet are the same as gradient-based methods except
    for ReLU. In regular backprop, gradients flow through ReLU to units with positive
    input activations, whereas in case of a DeconvNet, it is computed on positive output
    reconstructions.

## Strengths

- The visualization techniques are simple ideas and the results are interpretable. They show
that the method proposed by Erhan et al. in an unsupervised setting is useful to CNNs trained
in a supervised manner as well.
- The image-specific class saliency can be interpreted as those pixels which need to be changed
the least to have a maximum impact on the classification score.
- The relation between DeconvNet visualizations and optimization-based visualizations is
insightful.

## Weaknesses / Notes

- The thinking behind initializing with zero image and L2 regularization in class model
visualizations was missing.

arxiv.org
scholar.google.com

Going Deeper with Convolutions
Szegedy, Christian and Liu, Wei and Jia, Yangqing and Sermanet, Pierre and Reed, Scott and Anguelov, Dragomir and Erhan, Dumitru and Vanhoucke, Vincent and Rabinovich, Andrew
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces a neural network architecture
that is deeper and wider, yet optimizing for computational
efficiency by approximating the expected sparse structure
(following from Arora et al's work) using readily available
dense blocks. An ensemble of 7 models (all with the same
architecture but different image sampling) achieved top spot
in the classification task at ILSVRC2014.

"Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs."

Main contributions:

- A more generalized exploration of the NIN architecture,
called the Inception module.
- 1x1 convolutions to capture dense information clusters
- 3x3 and 5x5 to capture more spatially spread out
clusters
- Ratio of 3x3 and 5x5 to 1x1 convolutions increases as we go deeper
as features of higher abstraction are less spatially
concentrated.
- To avoid the blow-up of output channels cause by merging outputs
of convolutional layers and pooling layer, they use 1x1 convolutions
for dimensionality reduction. This has the added benefit of another
layer of non-linearity (and thus increasing discriminative capability).
- Multiple intermediate layers are tied to the objective function. Since
features produced by intermediate layers of a deep network are
supposed to be very discriminative, and to strengthen the gradient signal
passing through them during back-propagation, they attach auxiliary classifiers
to intermediate layers.
- During training, they do a weighted sum of this loss with the total loss
of the network.
- At test time, these auxiliary networks are discarded.
- Architecture: average pooling, 1x1 convolution (for dimensionality reduction),
dropout, linear layer with softmax.

## Strengths

- Excellent results on ILSVRC2014.

## Weaknesses / Notes

- Even though the authors try to explain some of the intuition, most of
the design decisions seem arbitrary.

papers.nips.cc
scholar.google.com

How transferable are features in deep neural networks?
Yosinski, Jason and Clune, Jeff and Bengio, Yoshua and Lipson, Hod
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper studies the transferability of features learnt at different layers
of a convolutional neural network. Typically, initial layers of a CNN learn
features that resemble Gabor filter or color blobs, and are fairly general, while
the later layers are more task-specific. Main contributions:

- They create two splits of the ImageNet dataset (A/B) and explore how performance
varies for various network design choices such as
    - Base: CNN trained on A or B.
    - Selffer: first n layers are copied from a base network, and the rest of the
    network is randomly initialized and trained on the same task.
    - Transfer: first n layers are copied from a base network, and the rest of the
    network is trained on a different task.
    - Each of these 'copied' layers can either be fine-tuned or kept frozen.

- Selffer networks without fine-tuning don't perform well when the split is somewhere
in the middle of the network (n = 3-6). This is because neurons in these layers co-adapt
to each other's activations in complex ways, which get broken up when split.
    - As we approach final layers, there is lesser for the network to learn and so these
    layers can be trained independently.
    - Fine-tuning a selffer network gives it the chance to re-learn co-adaptations.

- Transfer networks transferred at lower n perform better than larger n, indicating
that features get more task-specific as we move to higher layers.
    - Fine-tuning transfer networks, however, results in better performance. They argue
    that better generalization is due to the effect of having seen the base dataset,
    even after considerable fine-tuning.

- Fine-tuning works much better than using random features.

- Features are more transferable across related tasks than unrelated tasks.
    - They study transferability by taking two random data splits, and splits of
    man-made v/s natural data.

## Strengths

- Experiments are thorough, and the results are intuitive and insightful.

## Weaknesses / Notes

- This paper only analyzes transferability across different splits of ImageNet
(as similar/dissimilar tasks). They should have reported results on transferability
from one task to another (classification/detection) or from one dataset to another
(ImageNet/MSCOCO).

- It would be interesting to study the role of dropout in preventing co-adaptations
while transferring features.

arxiv.org
arxiv-vanity.com
scholar.google.com

Intriguing properties of neural networks
Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

The paper introduces two key properties of deep neural networks:

- Semantic meaning of individual units.
- Earlier works analyzed learnt semantics by finding images that maximally activate individual units.
- Authors observe that there is no difference between individual units and random linear combinations of units.
- It is the entire space of activations that contains the bulk of semantic information.

- Stability of neural networks to small perturbations in input space.
- Networks that generalize well are expected to be robust to small perturbations in the input, i.e. imperceptible noise in the input shouldn't change the predicted class.
- Authors find that networks can be made to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error.
- These 'adversarial examples' generalize well to different architectures trained on different data subsets.

## Strengths

- The authors propose a way to make networks more robust to small perturbations by training them with adversarial examples in an adaptive manner, i.e. keep changing the pool of adversarial examples during training. In this regard, they draw a connection with hard-negative mining, and a network trained with adversarial examples performs better than others.

- Formal description of how to generate adversarial examples and mathematical analysis of a network's stability to perturbations are useful studies.

## Weaknesses / Notes

- Two images that are visually indistinguishable to humans but classified differently by the network is indeed an intriguing observation.

- The paper feels a little half-baked in parts, and some ideas could've been presented more clearly.

papers.nips.cc
scholar.google.com

Learning Deep Features for Scene Recognition using Places Database
Zhou, Bolei and Lapedriza, Àgata and Xiao, Jianxiong and Torralba, Antonio and Oliva, Aude
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper introduces the Places dataset, which is a scene-centric
dataset at the scale of ImageNet (which is for object recognition)
so as to enable training of deep CNNs like AlexNet, and achieves
state-of-the-art for scene benchmarks. Main contributions:

- Collects a dataset at ImageNet scale for scene recognition.
- Achieves state-of-the-art on scene benchmarks: SUN397, MIT Indoor67, Scene15, SUN Attribute.
- Introduces measures for comparing datasets: density and diversity.
- Makes a thorough comparison b/w ImageNet and Places, from dataset to classification results to learned representation visualizations.

## Strengths

- Relative density and diversity are neat ideas for comparing datasets, and are backed by AMT experiments.
- Relative density: The more visually similar a nearest neighbour is to a randomly sampled image from a dataset, the more dense it is.
- Relative diversity: The more visually similar two randomly sampled images from a dataset are, the less diverse it is.
- Demonstrates via activation and mean image visualizations that different representations are learned by CNNs trained on ImageNet and Places
- Conv1 layer visualizations can be directly seen, and are similar for ImageNet-CNN and Places-CNN. They capture low-level information like oriented edges and colors.
- For higher layers, they visualize the average of top 100 images that maximize activations per unit. As we go deeper, ImageNet-CNN units have receptive fields that look more like object-blobs and Places-CNN have RFs that look more like landscapes with spatial structures.

## Weaknesses / Notes

- No explanation as to why the model trained on ImageNet and Places combined (minus overlapping images) performs better than ImageNet-CNN or Places-CNN on some benchmarks and worse on others.

arxiv.org
arxiv-vanity.com
scholar.google.com

Network In Network
Min Lin and Qiang Chen and Shuicheng Yan
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Abhishek Das 7 years ago

This paper studies a very natural generalization of convolutional layers
by replacing a single filter that slides over the input feature map with
a "micro network" (multi-layer perceptron). The authors argue that good
abstractions are highly non-linear functions of input data and instead of
generating an overcomplete number of feature maps and shrinking them down
in higher layers (as is the case in traditional CNNs), it would be beneficial
to generate better representations on each local patch, before feeding into
the next layer. Main contributions:

- Replaces the convolutional filter with a multi-layer perceptron.
- Instead of fully connected layers, uses global average pooling.

## Strengths

- Natural generalization of convolutional layers and thorough analysis.
- Global average pooling of feature layers is easier to interpret and less prone to overfitting.
- Better or at par with state-of-the-art classification results on CIFAR-10, CIFAR-100, SVHN, MNIST.

## Weaknesses / Notes

- Should have explored NIN without dropout.
- Results on ImageNet missing.
- The global average pooling idea, although interpretable,
doesn't seem to give easily to fine-tuning the network to
other datasets. In finetuning, we usually replace and learn
just the last layer.

arxiv.org
scholar.google.com

Neural Turing Machines
Graves, Alex and Wayne, Greg and Danihelka, Ivo
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

Neural Turing Machine (NTM) consists of a neural network controller interacting with a working memory bank in a learnable manner. This is analogous to computers — controllers = CPU (hidden activations as registers) and memory matrix = RAM. Key ideas:

- Controller (modified RNN) interacts with external world via input and output vectors, and with memory via read and write "heads"

- "Read" vector is a convex combination of row-vectors of $M_t$ (memory matrix at time $t$) — $r\_t = \sum w\_t(i) M\_t(i)$ where w_t is a vector of weightings over N memory locations

- "Writing" is decomposed into 1) erasing and 2) adding
    - The write head produces the erase vector e_t and the add vector a_t along with the vector of weightings over memory locations w_t
    - $M\_t(i) = M\_{t-1}(i)[1 - w_t(i) e_t] + w\_t(i) a\_t$
    - Erase and add vectors control which components of memory are updated, while weightings w_t control which locations are updated

- Weight vectors are produced by an addressing mechanism
    - Content-based addressing
        - Each head produces length M key k_t that is compared to each vector M_t(i) by cosine similarity and a temperature parameter. The weightings are normalized (softmax).

    - Location-based addressing
        - Interpolation: Each head produces interpolation gate g_t that is used to blend between weighting at previous time step and the content weighting of current tilmestep $w^{g}\_t = g\_t w^{c}\_t + (1-g\_t)w\_{t-1}$
        - Shift: Circular convolution (modulo N) with a shift weighting distribution, for example softmax over integer shift positions (say 3 locations)
        - Sharpening: Each head emits \gamma_t to sharpen the final weighting

- Experiments on copy, repeat-copy, associative memory, N-gram emulator and priority sort

## Links

- [Attention and Augmented RNNs](http://distill.pub/2016/augmented-rnns/)
- [NTM-Lasagne](https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315)

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents R-CNN, an approach to do object detection using CNNs pre-trained for image classification. Object proposals are extracted from the image using Selective Search, dilated by few pixels, warped to CNN input size and fed into the CNN to extract features (they experiment with pool5, fc6, fc7). These extracted feature vectors are scored using SVMs, one per class. Bounding box regression, where they predict parameters to move the proposal closer to ground-truth, further boosts localization.

The authors use AlexNet, pre-trained on ImageNet and finetuned for detection. Object proposals with IOU overlap greater than 0.5 are treated as positive examples, and others as negative, and a 21-way classification (20 object categories + background) is set up to finetune the CNN. After finetuning, SVMs are trained per class, taking only the ground-truth boxes as positives, and IOU <= 0.3 as negatives.

R-CNN achieves major performance improvements on PASCAL VOC 2007/2010 and ILSVRC2013 detection datasets. Finally, this method is extended to do semantic segmentation and achieves competitive results.

## Strengths

- The method is simple and effective.
- Extensive ablation studies show why R-CNN works.
    - FC7 is the best feature to use (against pool5, fc6).
    - Fine-tuning provides a large boost in performance.
    - VGG performs better than AlexNet.
    - Bounding box regression further improves localization.

## Weaknesses / Notes

- Each region proposal is treated independently, which adds up to compute time.
- There are lots of different parts; the network can't be trained end-to-end.

papers.nips.cc
scholar.google.com

Sequence to Sequence Learning with Neural Networks
Sutskever, Ilya and Vinyals, Oriol and Le, Quoc V.
Neural Information Processing Systems Conference - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents a simple approach to predicting
sequences from sequential input. They use a multi-layer
LSTM-based encoder-decoder architecture and show
promising results on the task of neural machine translation.
Their approach beats a phrase-based statistical machine
translation system by a BLEU score of > 1.0 and is close to
state-of-the-art if used to re-rank 1000-best predictions
from the SMT system. Main contributions:

- The first LSTM encodes an input sequence to a single
vector, which is then decoded by a second LSTM. End of sequence
is indicated by a special character.
    - 4-layer deep LSTMs.
    - 160k source vocabulary, 80k target vocabulary. Trained on
    12M sentences. Words in output sequence are generated by a softmax
    over fixed vocabulary.
    - Beam search is used at test time to predict translations
    (Beam size 2 does best).

## Strengths

- Qualitative results (PCA projections) show that learned representations are
fairly insensitive to active/passive voice, as sentences similar in meaning
are clustered together.

- Another interesting observation was that reversing the source
sequence gives a significant boost to translation of long sentences
and results in performance gain, most likely due to the introduction of
short-term dependencies that are more easily captured by the gradients.

## Weaknesses / Notes

- The reversing source input idea needs better justification,
otherwise comes across as an 'ugly hack'.

- To re-score the n-best list of predictions of the baseline,
they average confidences of LSTM and baseline model. They should
have reported re-ranking accuracies by using just the LSTM-model
confidences.

arxiv.org
scholar.google.com

Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, Karen and Zisserman, Andrew
- 2014 via Local Bibsonomy
Keywords: deep-learning, VGG

[link] Summary by Abhishek Das 7 years ago

This paper proposes a modified convolutional network architecture
by increasing the depth, using smaller filters, data augmentation
and a bunch of engineering tricks, an ensemble of which
achieves second place in the classification task and first place
in the localization task at ILSVRC2014.

Main contributions:

- Experiments with architectures with different depths from 11 to
19 weight layers.
- Changes in architecture
- Smaller convolution filters
- 1x1 convolutions: linear transformation of input channels
followed by a non-linearity, increases discriminative capability
of decision function.
- Varying image scales
- During training, the image is rescaled to set the length of the shortest side
to S and then 224x224 crops are taken.
- Fixed S; S=256 and S=384
- Multi-scale; Randomly sampled S from [256,512]
- This can be interpreted as a kind of data augmentation by scale jittering,
where a single model is trained to recognize objects over a wide range of scales.
- Single scale evaluation: At test time, Q=S for fixed S and Q=0.5(S_min + S_max)
for jittered S.
- Multi-scale evaluation: At test time, Q={S-32,S,S+32} for fixed S and Q={S_min,
0.5(S_min + S_max), S_max} for jittered S. Resulting class posteriors are averaged.
This performs the best.
- Dense v/s multi-crop evaluation
- In dense evaluation, the fully connected layers are converted to convolutional
layers at test time, and the uncropped image is passed through the fully convolutional net
to get dense class scores. Scores are averaged for the uncropped image and its
flip to obtain the final fixed-width class posteriors.
- This is compared against taking multiple crops of the test image and averaging scores
obtained by passing each of these through the CNN.
- Multi-crop evaluation works slightly better than dense evaluation, but the methods
are somewhat complementary as averaging scores from both did better than each of them
individually. The authors hypothesize that this is probably because of the different
boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.

## Strengths

- Thoughtful design of network architectures and experiments to study the effect of
depth, LRN, 1x1 convolutions, pre-initialization of weights, image scales,
and dense v/s multi-crop evaluations.

## Weaknesses / Notes

- No analysis of how much time these networks take to train.
- It is interesting how the authors trained a deeper model (D,E)
by initializing initial and final layer parameters with those from
a shallower model (A).
- It would be interesting to visualize and see the representations
learnt by three stacked 3x3 conv layers and one 7x7 conv layer, and
maybe compare their receptive fields.
- They mention that performance saturates with depth while going
from D to E, but there should have been a more formal characterization
of why that happens (deeper is usually better, yes? no?).
- The ensemble consists of just 2 nets, yet performs really well.

arxiv.org
scholar.google.com

Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D. and Fergus, Rob
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: cnn, deeplearning

[link] Summary by Abhishek Das 7 years ago

This paper introduces a novel visualization technique to understand
representations learnt by intermediate layers of a deep convolutional
neural network - DeconvNet. Using DeconvNet visualizations as a
diagnostic tool in different settings, the authors propose changes to the
model proposed by Alex Krizhevsky, which performs slightly better and
generalizes well to other datasets. Key contributions:

- Deconvolutional network
- Feature activations are mapped back to input pixel space by setting
other activations in the layer to zero and successively unpooling,
rectifying and filtering (using the same parameters).
- Unpooling is approximated by using switch variables to remember
the location of highest input activation (and hence these visualizations
are image-specific).
- Rectification involves passing the signal through a ReLU
non-linearity.
- Filtering involves convolving the reconstructed signal with
the transpose of the convolutional layer filters.
- Well-designed experiments to provide insights

## Strengths

- Observation of evolution of features
- Visualizations clearly demonstrate that lower layers
converge within a few epochs and upper layers
develop after a considerable number of epochs (40-50).
- Feature invariance
- Visualizations show that small transformations have a
dramatic effect on lower layers and lesser impact on higher
layers. The model is fairly stable to translation and scaling,
not so much to rotation.
- Occlusion sensitivity analysis
- Parts of the image are occluded, and posterior and activities
are visualized. Clearly show that activities drop when the object
is occluded.
- Correspondence analysis
- The intuition is that CNNs implicitly learn the correspondence between different parts.
- To verify this, dog images with frontal pose are taken and the same part of the face
is occluded in each of them. Then the difference in feature maps for each of those and the
original image is calculated, and the consistency of this difference across all image pairs
is verified by Hamming distance. Lower scores as compared to random occlusions does show
that the model learns correspondences.
- Proposed model performs better than Alex Krizhevsky's model, and generalizes
well to other datasets.

## Weaknesses / Notes

- The justification / intuition for choice of smaller filters wasn't convincing
enough.
- Why does removing layer 7 give better top-1 error rate on train and
val?
- Rotation invariance might be something worth looking into.

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Abhishek Das 7 years ago

This paper introduces a deep convolutional neural network (CNN) architecture
that achieved record-breaking performance in the 2012 ImageNet LSVRC. Notably,
it brings together a bunch of neat ideas in an end-to-end, trainable model.
Main contributions:

- Achieves state-of-the-art performance in ILSVRC-2012.
- Makes available an efficient, parallelized GPU implementation of their model.
- Describes in detail the features of their model that help in improving performance
and reducing training time, along with extensive ablative studies.
- Uses data augmentation and dropout to prevent overfitting.

## Strengths

- Uses (and popularizes) ReLUs instead of tanh as the non-linear activation unit, which makes training six times faster.
- Uses local response normalization and overlapped pooling.
- Data augmentation
    - Extracts random crops and performs image translations, horizontal reflections maintaining the label distribution.
    - Alters RGB pixel values by performing PCA on training set, and adding multiples of eigenvalues times a random variable drawn from a Gaussian to image. Provides invariance to changes in intensity and color of illumination.
- Dropout prevents overfitting. Randomly drops half of the neurons in the fully connected layers, and can be interpreted as averaging over exponentially-many dropout networks.

## Weaknesses / Notes

- Lacks theoretical insight. Design decisions are motivated solely by results.

www.aaai.org
sci-hub
scholar.google.com

What Question Would Turing Pose Today?
Grosz, Barbara J.
Artificial Intelligence Journal - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

- Turing, in his MIND paper in 1950, proposed an operational, behavioral alternative to the philosophical question "Can machines think?” by suggesting a simple "Turing test" where machines play the "imitation game” and humans are tasked with discerning machine from human given responses. He believed even partial success towards this goal given only 5 minutes of interaction would be hard and far-off.
- The Turing test hasn’t yet been met (except in restricted settings like Siri, Watson), but his prediction "one will be able to speak of machines thinking without expecting to be contradicted” has proved true — "smart” computers have become commonplace.
- One of the reasons Turing test hasn’t been met yet is because of the failures today’s intelligent systems make. Their capabilities are limited as type of questions they can handle, domains and their ability to handle unexpected input. Failure cases when "it doesn’t know that it doesn’t know” making humans exclaim how stupid it is.
- There is a realization that computers and humans have separate strengths, weaknesses and roles. Also, language is inherently social and connected to communicative purpose and human cooperation. It is intentional behavior, and not just stimulus-response. Language also assumes that participants have models of each other, models that influence what they say and how they say it. Retrospectively speaking, Turing’s imitation game misses these aspects. "Jeopardy” was clever in avoiding dialogue context and modeling other people’s behavior.
- Another big change: instead of input-output interactions with computers by humans, today humans + computers exist in "mixed” networks.
- Desirable properties in today's Turing test: interactive nature + use of language in real use (than success in game) + human-machine collaboration
- Proposed Turing test: "Is it imaginable that a computer (agent) team member could behave, over the long term and in uncertain, dynamic environments, in such a way that people on the team will not notice it is not human.”
- This doesn’t ask machine to appear like human, act or be mistaken for one, but non-humanness shouldn’t hit people in the face. Behavior shouldn’t baffle teammates, leaving them wondering not about what it is thinking but whether it is. Such a system will also need a model of teammates’ knowledge, abilities, preferences, etc.

arxiv.org
scholar.google.com

Actions ~ Transformations
Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 8 years ago

Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md).

This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect).

- Model
    - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer.
    - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training.
    - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices.
    - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin.
- ACT Dataset
    - 50 keywords, 43 classes, ~500 YouTube videos per keyword.
    - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"?
    - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes.
- Experiments
    - Action recognition on UCF101, HMDB51, ACT.
    - Cross-category generalization on ACT.
- Visualizations
    - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color.
    - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context.
    - Embedding retrievals based on transformed precondition embeddings.

** Thoughts **

- Modeling action as a transformation from precondition to effect is a very neat idea.
- The exact formulation and supporting experiments and ablation studies are thorough.
- During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass.

Abhishek Das

sciscore: 2.415