yenchenlin's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Generating Images with Perceptual Similarity Metrics based on Deep Networks
Alexey Dosovitskiy and Thomas Brox
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CV, cs.NE
more

[link] Summary by yenchenlin 8 years ago

This paper proposed a class of loss functions applicable to image generation that are based on distance in feature spaces:

$$\mathcal{L} = \lambda_{feat}\mathcal{L}_{feat} + \lambda_{adv}\mathcal{L}_{adv} + \lambda_{img}\mathcal{L}_{img}$$

### Key Points
- Using only l2 loss in image space yields over-smoothed results since it leads to averaging all likely locations of details.
- L_feat measures the distance in suitable feature space and therefore preserves distribution of fine details instead of exact locations.
- Using only L_feat yields bad results since feature representations are contractive. Many non-natural images also mapped to the same feature vector.
- By introducing a natural image prior - GAN, we can make sure that samples lie on the natural image manifold.

### Model

https://i.imgur.com/qNzMwQ6.png

### Exp
- Training Autoencoder
- Generate images using VAE
- Invert feature

### Thought
I think the experiment section is a little complicated to comprehend. However, the proposed loss seems really promising and can be applied to many tasks related to image generation.

### Questions
- Section 4.2 & 4.3 are hard to follow for me, need to pay more attention in the future

papers.nips.cc
scholar.google.com

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
Nguyen, Anh and Dosovitskiy, Alexey and Yosinski, Jason and Brox, Thomas and Clune, Jeff
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by yenchenlin 8 years ago

This paper performs activation maximization (AM) using Deep Generator Network (DGN), which served as a learned natural iamge prior, to synthesize realistic images as inputs and feed it into the DNN we want to understand.
By visualizing synthesized images that highly activate particular neurons in the DNN, we can interpret what each of neurons in the DNN learned to detect.

### Key Points

- DGN (natural image prior) generates more coherent images when optimizing fully-connected layer codes instead of low-level codes. However, previous studies showed that low-level features results in better reconstructions beacuse it contains more image details. The difference is that here DGN-AM is trying to synthesize an entire layer code from scratch. Features in low-level only has a small, local receptive field so that the optimization process has to independently tune image without knowing the global structure. Also, the code space at a convolutional layer is much more high-dimensional, making it harder to optimize.

- The learned prior trained on ImageNet can also generalize to Places.
- It doesn't generalize well if architecture of the encoder trained with DGN is different with the DNN we wish to inspect.
- The learned prior also generalizes to visualize hidden neurons, producing more realistic textures/colors.
- When visualizing hidden neurons, DGN-AM trained on ImageNet also generalize to Places and produce similar results as [1].
- The synthesized images are showed to teach us what neurons in DNN we wish to inspect prefer instead of what prior prefer.

### Model

![](https://cloud.githubusercontent.com/assets/7057863/21002626/b094d7ae-bd61-11e6-8c95-fd4931648426.png)

### Thought
Solid paper with diverse visualizations and thorough analysis.

### Reference
[1] Object Detectors Emerge In Deep Scene CNNs, B.Zhou et. al.

papers.nips.cc
scholar.google.com

SoundNet: Learning Sound Representations from Unlabeled Video
Aytar, Yusuf and Vondrick, Carl and Torralba, Antonio
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by yenchenlin 8 years ago

This paper developed a semantically rich representation for natural sound using unlabeled videos as a bridge to
transfer discriminative visual knowledge from well-established visual recognition models into the sound modality.
The learned sound representation yields significant performance improvements on standard benchmarks for acoustic
scene classification task.

### Key Points

- The natural synchronization between vision and sound can be leveraged as a supervision signal for each other.
- Cross-modal learning can overcome overfitting if the target modal have much fewer data than other modals, which is essential for deep networks to work well.
- In the sound classification task, **pool5** and **conv6** extracted from SoundNet achieve best performance.

### Model
- The authors proposed a student-teacher training procedure to transfer discriminative visual knowledge from visual recognition models
trained on ImageNet and Places into the SoundNet by minimizing KL divergence between their predictions.
![](https://cloud.githubusercontent.com/assets/7057863/20856609/05fe12d6-b94e-11e6-8c92-995ee84fe0d7.png)
- Two reasons to use CNN for sound: 1. invariant to translations; 2. stacking layers to detect higher-level concepts.

### Exp

- Adding a linear SVM upon representation learned from SoundNet outperforms other existing methods 10%.
- Using lots of unlabeled videos as supervision signals enable the deeper SoundNet to work, or otherwise the 8-layer networks
performs poorly due to overfitting.
- Simultaneous Using Places and ImageNet as supervision beats using only one of them 3%.
- Multi-modal recognition models use visual and sound data together yields 2% gain in classification accuracy.

### Thought
I think this paper is really complete since it contains good intuition, ablation analysis, representation visualization, hidden unit visualization, and significent performance imporvements.

### Questions
- Although paper said that "To handle variable-temporal-length of input sound, this model uses a fully convolutional network and produces an output over multiple timesteps in video.", but the code seems to set the length of each excerpts fixed to 5 seconds.
- It looks not clear for me about the data augmentation technique used in training.

proceedings.mlr.press
scholar.google.com

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Finn, Chelsea and Abbeel, Pieter and Levine, Sergey
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by yenchenlin 8 years ago

The authors propose an algorithm for meta-learning that is compatible with any model trained with gradient descent, and show that it works on various domain including supervised learning and reinforcement learning. This is done by explicitly train the network such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.

### Key Points

- MAML is actually finding a good **initialization** of model parameters for several tasks.
- Good initialization of parameters means that it can achieve good performance on several tasks with small number of gradient steps.

### Method
- Simultaneously optimize the **initialization** of model parameters of different meta-training tasks, hoping that it can quickly adapt to new meta-testing tasks.

![](https://cloud.githubusercontent.com/assets/7057863/25161911/46f2721e-24f1-11e7-9fba-8bc2f0782204.png)

- Training procedure:

![](https://cloud.githubusercontent.com/assets/7057863/25161749/8d00902a-24f0-11e7-93a8-6a9b74386f55.png)



### Exp

- It acheived performance that is comparable to the state-of-the-art on classification/regression/reinforcement learning tasks.

### Thought
I think the experiments are thorough since they proved that this technique can be applied to both supervised and reinforcement learning. However, the method is not novel provided that [Optimization a A Midel For Few-shot Learning](https://openreview.net/pdf?id=rJY0-Kcll) already proposed to learn initialization of parameters.

yenchenlin

sciscore: 2