Martin Thoma's profile - ShortScience.org

arxiv.org
scholar.google.com

Metadata Embeddings for User and Item Cold-start Recommendations
Kula, Maciej
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 5 years ago

The idea is to combine collaborative filtering with content-based recommenders to mitigate the user and item coldstart problems.

The author distinguishes between positive and negative interactions.

The representation of a user and of items is the sum of all their latent representations. This sounds similar to "**Asymmetric factor models**" as described in [the BellKor Netflix price solution](https://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf). **The key idea is to encode the latent user (or item) vector as a sum of latent attribute vectors.**

Adagrad / asynchronous stochastic gradient descent was used for optimization.


## See also

* [Code on GitHub](https://lyst.github.io/lightfm/docs/index.html#)
* [Paper on ArXiv](https://arxiv.org/pdf/1507.08439.pdf)

scholar.google.com

Collaborative Filtering for Implicit Feedback Datasets
Hu, Yifan and Koren, Yehuda and Volinsky, Chris
International Conference on Data Mining - 2008 via Local Bibsonomy
Keywords: collaborativfiltering, alternaterootsquare

[link] Summary by Martin Thoma 5 years ago

This paper is about a recommendation system approach using collaborative filtering (CF) on implicit feedback datasets.

The core of it is the minimization problem

$$\min_{x_*, y_*} \sum_{u,i} c_{ui} (p_{ui} - x_u^T y_i)^2 + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

with

* $\lambda \in [0, \infty[$ is a hyper parameter which defines how strong the model is regularized
* $u$ denoting a user, $u_*$ are all user factors $x_u$ combined
* $i$ denoting an item, $y_*$ are all item factors $y_i$ combined
* $x_u \in \mathbb{R}^n$ is the latent user factor (embedding); $n$ is another hyper parameter. $n=50$ seems to be a reasonable choice.
* $y_i \in \mathbb{R}^n$ is the latent item factor (embedding)
* $r_{ui}$ defines the "intensity"; higher values mean user $u$ interacted more with item $i$
* $p_{ui} = \begin{cases}1 & \text{if } r_{ui} >0\\0 &\text{otherwise}\end{cases}$
* $c_{ui} := 1 + \alpha r_{ui}$ where $\alpha \in [0, \infty[$ is a hyper parameter; $\alpha =40$ seems to be reasonable

In contrast, the standard matrix factoriation optimization function looks like this ([example](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf)):

$$\min_{x_*, y_*} \sum_{(u, i, r_{ui}) \in \mathcal{R}} {(r_{ui} - x_u^T y_i)}^2  + \underbrace{\lambda \left ( \sum_u || x_u ||^2 + \sum_i || y_i ||^2\right )}_{\text{Regularization}}$$

where

* $\mathcal{R}$ is the set of all ratings $(u, i, r_{ui})$ - user $u$ has rated item $i$ with value $r_{ui} \in \mathbb{R}$

They use alternating least squares (ALS) to train this model.

The prediction then is the dot product between the user factor and all item factors ([source](https://github.com/benfred/implicit/blob/master/implicit/recommender_base.pyx#L157-L176))

doi.org
sci-hub
scholar.google.com

Learning to rank using gradient descent
Burges, Christopher J. C. and Shaked, Tal and Renshaw, Erin and Lazier, Ari and Deeds, Matt and Hamilton, Nicole and Hullender, Gregory N.
International Conference on Machine Learning - 2005 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 5 years ago

[Learning to rank using gradient descent](https://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) is a paper published in 2005 by Burges et all from Microsoft. The paper introduced RankNet.

RankNet is a neural network for recommendations.

The main use-case of the paper is ranking search results.

## Key Ideas

* Preprocessing: Filter results which are relevant
* Ranking: Rank results which are relevant by RankNet

## See also

* [Adapting deep RankNet for personalized search](https://www.shortscience.org/paper?bibtexKey=conf/wsdm/SongWH14)

doi.org
sci-hub
scholar.google.com

Adapting deep RankNet for personalized search
Song, Yang and Wang, Hongning and He, Xiaodong
ACM WSDM - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 5 years ago

[Adapting Deep RankNet for Personalized Search](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/wsdm233-song.pdf) is a paper published in 2014 by Song, Wang and He from Microsoft Research. It is heavily beased on [Learning to rank using gradient descent](https://www.shortscience.org/paper?bibtexKey=conf/icml/BurgesSRLDHH05) (Burges et al from Microsoft, 2005).

They use a neural network with 5 hidden layers. They investigate regularization by trunkated gradient and limiting the depth of the back propagation.


## See also

* July 2015: [RankNet: A ranking retrospective](https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/)

dx.doi.org
sci-hub
scholar.google.com

Inferring Road Maps from Global Positioning System Traces
James Biagioni and Jakob Eriksson
Transportation Research Record: Journal of the Transportation Research Board - 2012 via Local CrossRef
Keywords:

[link] Summary by Martin Thoma 6 years ago

[https://www.cs.uic.edu/~jakob/papers/biagioni-trr12.pdf](https://www.cs.uic.edu/~jakob/papers/biagioni-trr12.pdf) is a super nice survey where I almost feel bad to summarize it. Almost.

It is about road map inference in the presence of traces of geo coordinates. So you have many vehicles which log their positions while driving. From this data, you want to infer the latest map.

Problems:
* The map changes: What was a valid map a week ago, might not be anymore (due to construction work, breaking roads, new roads)
* Sensor errors: GPS is not very accurate

## Contributions

* overview over the literature up to 2012 on map generation
* method for the automatic evaluation of generated maps
* an evaluation of three reference algorithms including code
* a [118-h trace data set](https://www.cs.uic.edu/bin/view/Bits/Software) and ground truth map

## How Map Inference Works

* **Preprocess Geo-Traces**: check for unreasonable speed, too extreme acceleration, too abrupt changes
* Inference:
* k-Means: Reduce candidates to centroid. Works on 3D - 2 coordinates and direction
* trace merging: Merge edges directly, without reduction
* kernel density estimation

## Quantitative Evaluation

1. Start at one location in both, the ground truth map and the generated map
2. Follow streets from both and sample points in a fixed range (e.g. each 50cm). Only go in directions that go away from the start
3. Try to match both, ground truth and generated map points. Only match if points are close enough (e.g. a distance of 10cm if this is an acceptable uncertainty). Then use a classification score (e.g. F1)

dx.doi.org
sci-hub
scholar.google.com

Deriving HD maps for highly automated driving from vehicular probe data
Massow, Kay and Kwella, B. and Pfeifer, N. and Hausler, Florian and Pontow, Jens and Radusch, Ilja and Hipp, Jochen and Dölitzscher, Frank and Haueis, Martin
IEEE ITSC - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 6 years ago

A high definition (HD) map in the context of highly automated vehicles is a digital map which contains precise information (error < 1m) necessary for localization and for driving behavior:

* **road geometry**: Where is the road, where are lanes?
* **[street furniture](https://en.wikipedia.org/wiki/Street_furniture)**: street signs, traffic lights
* **dynamic data**: end of traffic jam, construction work

This kind of information can be stored in the [OpenDRIVE format](http://www.opendrive.org/docs/OpenDRIVEFormatSpecRev1.1D.pdf), which is an XML format with the extension `.xodr`.

Creating and maintaining those maps is costly with dedicated cars, so we would like to have an alternative which works automatically. This paper proposes one. **The best summary of the paper is Figure 3.**

## Possible Sensors

The following kind of sensors can be equipped in many cars:

* stereo cameras
* radar
* GPS

## Active Players

* **Continental** Road Database: Automatic Recording and Processing of Highly Accurate Route Data
* [HERE](https://www.here.com/en): map provider which has published a sensor interface specification: “Vehicle Sensor Data Cloud Ingestion Interface
Specification (v2.0.2)

## See also

* [Inferring road maps from global positioning system traces - Survey and comparative evaluation](https://www.cs.uic.edu/~jakob/papers/biagioni-trr12.pdf)
* [Hochgenaue Fahrzeugeigenlokalisierung und kollektives Erlernen hochgenauer digitaler Karten]()

arxiv.org
scholar.google.com

Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art
Janai, Joel and Güney, Fatma and Behl, Aseem and Geiger, Andreas
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

## Problems

* **Computer Vision**:
  * Detection: Given a 2D image, where are cars, pedestrians, traffic signs?
  * Depth estimation: Given a 2D image, estimate the depth
* **Planning**: Where do I want to go?
* **Control**: How should I steer?

## Datasets

* KITTI: Street segmentation (Computer Vision)
* ISPRS
* MOT
* Cityscapes


## What I missed

* GTSRB: The German Traffic Sign Recognition Benchmark dataset
* GTSDB: The German Traffic Sign Detection Benchmark

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

## See also
Related papers:
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)

Blog posts:
* Dhruv Parthasarathy: [A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN](https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4)

arxiv.org
scholar.google.com

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

## See also
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)

dx.doi.org
sci-hub
scholar.google.com

What Does Classifying More Than 10,000 Image Categories Tell Us?
Jia Deng and Alexander C. Berg and Kai Li and Li Fei-Fei
Lecture Notes in Computer Science - 2010 via Local CrossRef
Keywords:

[link] Summary by Martin Thoma 7 years ago

In this paper the authors experiment with 10,000 image classes based on ImageNet. As ImageNet is based on Wordnet, they have a semantic tree of the categories.

It should be noted that this paper is from 2010. Hence before AlexNet. They don't use CNNs in this paper.

## Key findings

* A relationship between visual similarity and semantic similarity exists
* Classification can be improved by exploiting semantic hierarchy
* Computational difficulties with 10,000 classes
* More classes -> lower mean accuracy

## See also

* [What makes ImageNet good for transfer learning?](https://arxiv.org/abs/1608.08614) ([slides](https://www.dropbox.com/s/vfmncjnyh57glkc/NIPS_LSCVS_ImageNet%20Analysis.pdf?dl=0))

arxiv.org
arxiv-vanity.com
scholar.google.com

Early Stopping without a Validation Set
Maren Mahsereci and Lukas Balles and Christoph Lassner and Philipp Hennig
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Martin Thoma 7 years ago

Summary from [reddit](https://www.reddit.com/r/MachineLearning/comments/623oq4/r_early_stopping_without_a_validation_set/dfjzwqq/):

We want to minimize the expected risk (loss) but that's a mean over the real distribution of the data, which we don't know. We approximate that by using a finite dataset and try to minimize the empirical risk instead.
The gradients for the empirical risk are an approximation to the gradients for the expected risk.
The idea is that the real gradients contain just information whereas the approximated gradients contain information + noise. The noise results from using a finite dataset to approximate the real distribution of the data.
By computing local statistics about the gradients, the authors are able to determine when the gradients have no information about the expected risk anymore and what's left is just noise. If we keep optimizing we're going to overfit.

dl.acm.org
sci-hub
scholar.google.com

Statistical Comparisons of Classifiers over Multiple Data Sets
Dem\v{s}ar, Janez
JMLR.org J. Mach. Learn. Res. - 2006 via Local Bibsonomy
Keywords: significance, testing, prediction, classification

[link] Summary by Martin Thoma 7 years ago

Describes how to compare classifiers when they were evaluated on multiple datasets (e.g. CIFAR 10, MNIST and SVHN). Recommends Wilcoxon signed ranks test and Friedman test with the corresponding post-hoc tests. Introduce CD (critical difference) diagrams.

* McNemar test and 5x2cv are good when comparing two classifiers on one dataset
* Describes the Wilcoxon Signed-Ranks Test in section 3.1.3 in detail

dx.doi.org
sci-hub
scholar.google.com

Approximate Statistical Test For Comparing Supervised Classification Learning Algorithms
Dietterich, Thomas G.
Neural Computation Journal - 1998 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper describes some statistical test very neatly. It also has a nice Figure which classifies statistical questions in machine learning.

* 5x2cv: 5 iterations of 2 fold cross validation
* Quasi-F test
* McNemar's test: Described in detail!

## See also

* [On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach](http://www.shortscience.org/paper?bibtexKey=journals/datamine/Salzberg97)

dx.doi.org
sci-hub
scholar.google.com

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach
Salzberg, Steven
Data Min. Knowl. Discov. - 1997 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper describes common pitfalls when classifiers are compared and recommends McNemars test

## Notes

* t-test is simply the wrong test for such an experimental design


## See also

* Prechelt "A quantitative study of experimental evaluations of neural network algorithms" - most of 200 evaluated paper had flaws
* Wolpert "On the connection between in-sample testing and generalization error" - No classifier is always better than another one.
* Diettrich: [Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms](http://www.shortscience.org/paper?bibtexKey=journals/neco/Dietterich98)
* Demsar: [Statistical Comparisons of Classifiers over Multiple Data Sets](http://www.shortscience.org/paper?bibtexKey=demvsar2006statistical)

arxiv.org
scholar.google.com

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Szegedy, Christian and Ioffe, Sergey and Vanhoucke, Vincent
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper describes the CNN architecture Inception-v4.

They basically update Inception-v3 to use residual connections (see [He et al](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15)). They also simplified the architecture as they moved from DistBelief to [TensorFlow](https://www.tensorflow.org/).

## Previous papers

* Inception-v1: [Going deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)

arxiv.org
arxiv-vanity.com
scholar.google.com

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
Pierre Sermanet and David Eigen and Xiang Zhang and Michael Mathieu and Rob Fergus and Yann LeCun
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 7 years ago


## Terms

* Classification: Assign one label to one image
* Localization: Many object, there might be multiple instances of the same class.
* Detection: One big dominant object which has to be found and to be classified. Pretty much first classification and then finding a bounding box for the class within the image

## Evaluation

* ILSVRC2013, localization task
* ILSVRC2013, detection task
* ILSVRC2013, classification task

## Implementations

* https://github.com/sermanet/OverFeat

arxiv.org
arxiv-vanity.com
scholar.google.com

Convolutional Neural Fabrics
Shreyas Saxena and Jakob Verbeek
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Martin Thoma 7 years ago

Convolutional Neural Fabrics (CNFs) are a construction algorithm for CNN architectures.

> Instead of aiming to select a single optimal architecture, we propose a “fabric” that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern.

![Image](http://i.imgur.com/wlISXgo.png)

* **Pooling**: CNFs don't use pooling. However, this might not be necessary as they use strided convolution.
* **Filter size**: All convolutions use kernel size 3.
* **Output layer**: Scale $1 \times 1$, channels = nr of classes
* **Activation function**: Rectified linear units (ReLUs) are used at all nodes.

## Evaluation

* Part Labels dataset (face images from the LFW dataset): a super-pixel accuracy of 95.6%
* MNIST: 0.33% error (see [SotA](https://martin-thoma.com/sota/#image-classification); 0.21 %)
* CIFAR10: 7.43% error (see [SotA](https://martin-thoma.com/sota/#image-classification); 2.72 %)


## What I didn't understand

* "Activations are thus a linear function over multi-dimensional neighborhoods, i.e. a four dimensional
3×3×3×3 neighborhood when processing 2D images"
* "within the first layer, channel c at scale s receives input from channels c + {−1, 0, 1} from scale s − 1": Why does the scale change? Why doesn't the  first layer receive input from the same scale?

arxiv.org
arxiv-vanity.com
scholar.google.com

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc Le and Geoffrey Hinton and Jeff Dean
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.CL, cs.NE, stat.ML
more

[link] Summary by Martin Thoma 7 years ago

A NLP paper.

> "conditional computation, achieving greater than 1000x improvements in model capacity with
only minor losses in computational efficiency on modern GPU clusters. We introduce
a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to
thousands of feed-forward sub-networks"

## Evaluation
* 1 billion word language modeling benchmark
* 100 billion word google news corpus

arxiv.org
arxiv-vanity.com
scholar.google.com

Designing Neural Network Architectures using Reinforcement Learning
Bowen Baker and Otkrist Gupta and Nikhil Naik and Ramesh Raskar
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

## Ideas

* Find CNN topology with Q-learning and $\varepsilon$-greedy exploration and experience replay


## Evaluation

The authors seem not to know DenseNets

* CIFAR-10: 6.92 % accuracy ([SOTA](https://martin-thoma.com/sota/#image-classification) is 3.46 % - not mentioned in the paper)
* SVHN: 2.06 % accuracy ([SOTA](https://martin-thoma.com/sota/#image-classification) is 1.59% - not mentioned in the paper)
* MNIST: 0.31 % ([SOTA](https://martin-thoma.com/sota/#image-classification) is 0.21 % - not mentioned in the paper)
* CIFAR-100: 27.14 % accuracy ([SOTA](https://martin-thoma.com/sota/#image-classification) is 17.18 % - not mentioned in the paper)


## Related Work

* Google: [Neural Architecture Search with Reinforcement Learning](https://arxiv.org/abs/1611.01578) ([summary](http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.01578#martinthoma))

arxiv.org
arxiv-vanity.com
scholar.google.com

Universal representations:The missing link between faces, text, planktons, and cat breeds
Hakan Bilen and Andrea Vedaldi
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV, stat.ML
more

[link] Summary by Martin Thoma 7 years ago

This paper is about transfer learning for computer vision tasks.

## Contributions
* Before this paper, people focused on similar datasets (e.g. ImageNet-like images) or even the same dataset but a different task (classification -> segmentation). This paper, they look at extremely different dataset (ImageNet-like vs text) but only one task (classification). They show that all layers can be shared (including the last classification layer) between datasets such as MNIST and CIFAR-10
* Normalizing information is necessary for sharing models between datasets in order to compensate for dataset-specific differences. Domain-specific scaling parameters work well.

## Evaluation

* Used datasets:
  1. MNIST (10 classes: handwritten digits 0-9),
  2. SVHN (10 classes: house number digits, 0-9),
  3. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) (10 classes: airplane, automobile, bird, ...)
  4. Daimler Mono Pedestrian Classification Benchmark (18 × 36 pixels)
  5. Human Sketch dataset (20000 human sketches of every day objects such as “book”, “car”, “house”, “sun”)
  6. German Traffic Sign Recognition (GTSR) Benchmark (43 traffic signs)
  7. Plankton imagery data (classification benchmark that contains 30336 images of various organisms ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies)
  8. Animals with Attributes (AwA): 30475 images of 50 animal species (for zero-shot learning)
  9. Caltech-256: object classification benchmark (256 object categories and an additional background class)
  10. Omniglot: 1623 different handwritten characters from 50 different alphabets (one shot learning)
* images are resized to 64 × 64 pixels, greyscale ones are converted into RGB by setting the three channels to the same value
* Each dataset is also whitened, by subtracting its mean and dividing it by its standard deviation per channel
* **Architecture**: ResNet + Global Average Pooling + FC with Softmax
* "As the majority of the datasets have a different number of classes, we use a dataset-specific fully connected layer in our experiments unless otherwise stated."
* **Data augmentation**: We follow the same data augmentation strategy in [[18]](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15), the 64 × 64 size whitened image is padded with 8 pixels on all sides and a 64×64 patch randomly sampled from the padded image or its horizontal flip (except for MNIST / Omniglot / SVHN, as those contain text)
* **Training**: stochastic gradient descent with momentum

Sharing strategies:

1. Baseline: Train networks for each dataset independantly
2. Full sharing: For MNIST / SVHN / CIFAR-10, group classes randomly together so that Node 2 might be digit "7" for MNIST, digit "3" for SVHN and "aeroplane" for CIFAR-10. They are trained together in one network.
3. Deep sharing: Share all layers except the last one. Use all 10 datasets for this.
4. Partial sharing: Have a dataset-specific first part to compensate for different image statistics, but share the middle of the network.

The results seem to be inconclusive to me.


## Follow-up / related work

arxiv.org
arxiv-vanity.com
scholar.google.com

The Mythos of Model Interpretability
Zachary C. Lipton
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CV, cs.NE, stat.ML
more

[link] Summary by Martin Thoma 7 years ago

This paper

1. Explains why we want "interpretability" and hence what it can mean, depending on what we want.
2. Properties of interpretable models
3. Gives examples

It is easy to read. A must-read for everybody who wants to know about interpretability of models!


## Why we want interpretability

* Trust
  * Intelligibility: Confidence in models accuracy vs
  * Transparency: Understanding the model

## How to achieve interpretability

* Post-hoc explanations (saliency maps)
* t-SNE

arxiv.org
arxiv-vanity.com
scholar.google.com

DelugeNets: Deep Networks with Massive and Flexible Cross-layer Information Inflows
Jason Kuen and Xiangfei Kong and Gang Wang
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Martin Thoma 7 years ago

It's not clear to me where the difference between DenseNets and DelungeNets are.

## Evaluation

* Cifar-10: 3.76% error (DenseNet: )
* Cifar-100: 19.02% error

## See also

* [reddit](https://www.reddit.com/r/MachineLearning/comments/5l0k6w/r_delugenets_deep_networks_with_massive_and/)
* [DenseNet](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Architecture Search with Reinforcement Learning
Barret Zoph and Quoc V. Le
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI, cs.NE
more

[link] Summary by Martin Thoma 7 years ago

Find a topology by reinforcement learning. They use REINFORCE from [Williams 1992](http://www.shortscience.org/paper?bibtexKey=Williams:92).


## Ideas

* Structure and connectivity of a Neural Network can be represented by a variable-length string.
* The RNN controller in Neural Architecture Search is auto-regressive, which means it predicts hyperparameters one a time, conditioned on previous predictions
* policy gradient method to maximize the expected accuracy of the sampled architectures
* In our experiments, the process of generating an architecture stops if the number of layers exceeds
a certain value.

## Evaluation

* Computer Vision - **CIFAR-10**: 3.65% error (State of the art are Dense-Nets with 3.46% error)
* Language - **Penn Treebank**: a test set perplexity of 62.4 (3.6 perplexity better than the previous state-of-the-art)

They had a Control Experiment "Comparison against Random Search" in which they showed that they are much better than a random exploration of the data. However, the paper lacks details how exactly the random search was implemented.

## Related Work

* [Designing Neural Network Architectures using Reinforcement Learning](https://arxiv.org/pdf/1611.02167.pdf) ([summary](http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.02167))

dx.doi.org
sci-hub
scholar.google.com

Evolving Neural Network through Augmenting Topologies
Stanley, Kenneth O. and Miikkulainen, Risto
Evolutionary Computation - 2002 via Local Bibsonomy
Keywords: genetic, neat, networks, hyperneat

[link] Summary by Martin Thoma 7 years ago

This paper introduces NEAT (NeuroEvolution of Augmenting Topologies), a genetic algorithm for finding and training neural network topologies. NEAT trains the structure and the weights.

> The weight space is explored through the crossover of network weight vectors and through the mutation of single networks’ weights.


## Evaluation

* XOR-Problem
* Reinforcement learning: balance two poles attached to a cart by moving the cart in appropriate directions to keep the pole from falling

## Topology Encoding

* Direct Encodings
  * Bit string: A bit string encodes the connection matrix / matrices. First used in Structured Genetic Algorithm (sGA). Limitations: Crossover might not be useful; fixed size of layers.
  * Graph encoding: First used in Parallel Distributed Genetic Programming (PDGP).
* Indirect Encodings
  * Cellular Encoding (CE): genomes are programs written in a specialized graph transformation language

## Glossary

For people (like me) who are new to genetic algorithms (GAs):

* [Neuroevolution](https://en.wikipedia.org/wiki/Neuroevolution) (NE): a form of machine learning that uses evolutionary algorithms to train artificial neural networks
* [crossover](https://en.wikipedia.org/wiki/Crossover_(genetic_algorithm)): a genetic operator used to vary the programming of a chromosome or chromosomes from one generation to the next
* [Speciation](https://en.wikipedia.org/wiki/Speciation): the evolutionary process by which biological populations evolve to become distinct species
* TWEANNs: Topology and Weight Evolving Artificial Neural Networks

## Realted

* 2009, K. O. Stanley , D. B. D’Ambrosio and J. Gauci: [A Hypercube-Based Indirect Encoding for Evolving Large-Scale Neural Networks.](http://www.shortscience.org/paper?bibtexKey=journals/alife/StanleyDG09): Introduces HyperNEAT

papers.nips.cc
scholar.google.com

Doubly Convolutional Neural Networks
Zhai, Shuangfei and Cheng, Yu and Zhang, Zhongfei (Mark) and Lu, Weining
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

The paper ([arxiv](https://arxiv.org/abs/1610.09716)) introduces DCNNs (Doubly Convolutional Neural Networks). Those are CNNs which contain a new layer type which generalized convolutional layers.

## Ideas

CNNs seem to learn many filters which are similar to other learned filters in the same layer. The weights are only slightly shifted.

The idea of double convolution is to learn groups filters where filters within each group are translated versions of each other. To achieve this, a doubly convolutional layer allocates a set of meta filters which has filter sizes that are larger than the effective filter size. Effective filters can be then extracted from each meta filter, which corresponds to convolving the meta filters with an identity kernel. All the extracted filters are then concatenated, and convolved with the input.

> We have also confirmed that replacing a convolutional layer with a doubly convolutional layer consistently improves the performance, regardless of the depth of the layer.

## Evaluation

* CIFAR-10+: 7.24% error
* CIFAR-100+: 26.53% error
* ImageNet: 8.23% Top-5 error


## Critique

The k-translation correlation is effectively a [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). I think the authors should have mentioned that.


## Related

TODO

arxiv.org
arxiv-vanity.com
scholar.google.com

Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images
Aravindh Mahendran and Andrea Vedaldi
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, 68T45
more

[link] Summary by Martin Thoma 7 years ago

This paper is about finding naturally looking images for the analysis of machine learning models in computer vision. There are 3 techniques:

* **inversion**: the aim is to reconstruct an image from its representation
* **activation maximization**: search for patterns that maximally stimulate a representation component (deep dream). This does NOT use an initial natural image.
* **caricaturization**: exaggerate the visual patterns that a representation detects in an image

The introduction is nice.


## Code

The paper comes with code: [robots.ox.ac.uk/~vgg/research/invrep](http://www.robots.ox.ac.uk/~vgg/research/invrep/index.html) ([GitHub: aravindhm/deep-goggle](https://github.com/aravindhm/deep-goggle))


## Related

* 2013, Zeiler & Fergus: [Visualizing and Understanding Convolutional Networks ](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma)

arxiv.org
scholar.google.com

Visualizing and Comparing Convolutional Neural Networks
Yu, Wei and Yang, Kuiyuan and Bai, Yalong and Yao, Hongxun and Rui, Yong
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper is about the analysis of CNNs. It seems to be extremely similar to what Zeiler & Fergus did. I can't see the contribution. Only cited 7 times, although it is from December 2014 -> I suggest to read the Zeiler & Fergus paper instead.

## Related

* 2013, Zeiler & Fergus: [Visualizing and Understanding Convolutional Networks ](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma)


## Errors

* " Section ?? provides"

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained.

When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs.

## Key contributions

* Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data.
* Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks
* The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4.

## What I learned

* Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels.
* We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought.

## Funny

> deep neural nets remain mysterious for many reasons

> Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call.

## See also

* [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg)

arxiv.org
scholar.google.com

Big Neural Networks Waste Capacity
Dauphin, Yann and Bengio, Yoshua
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

The paper 'Big Neural Networks Waste Capacity' recognizes that adding more layer / parameters does not improve accuracy. When reading this paper, one should bear in mind that it was written well before [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15) or DenseNets.

In the experiments, they applied MLPs to SIFT features of ImageNet LSVRC-2010.

**Do not read this paper**. Instead, you might want to read the "Deep Residual Learning for Image Recognition". It makes the same point, but clearer and offers a solution to the underfitting problem.


## Criticism

I don't understand why they write about k-means.

> Assuming minimal error in the human labelling of the dataset, it should be possible to reach errors close to 0%.

For ImageNet, the human labeling error is estimated at about 5% (I can't find the source for that, though)


> Improvements on ImageNet are thought to be a good proxy for progress in object recognition (Deng et al., 2009).

ImageNet images are very different from "typical web images" like the [100 million images Flickr dataset](http://yahoolabs.tumblr.com/post/89783581601/one-hundred-million-creative-commons-flickr-images-for).

nips.djvuzone.org
sci-hub
scholar.google.com

The Cascade-Correlation Learning Architecture
Fahlman, Scott E. and Lebiere, Christian
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

Cascade Correlation is an algorithm to create feed-forward neural network architectures. However, those architectures are not the typical layered architectures. See [my YouTube video](https://www.youtube.com/watch?v=1E3XZr-bzZ4) for a short explanation of the constructed architecture.

For the "correlation" part, see [this question](http://datascience.stackexchange.com/q/9672/8820).

## Related work

See [Meiosis Networks summary](http://www.shortscience.org/paper?bibtexKey=conf/nips/Hanson89#martinthoma) for many topology learning papers

scholar.google.com

Dynamic Node Creation in Backpropagation Networks
Ash, Timur
Connection Science - 1989 via Local Bibsonomy
Keywords: nn

[link] Summary by Martin Thoma 7 years ago

Dynamic Node Creation (DNC) is about topology learning. DNC sequentially adds single nodes to the network until the desired accuracy is achieved.

DNC uses the logistic activation function and creates layered feed-forward architectures with only one hidden layer. So basically they only added one neuron at a time to the existing hidden layer.

They expected this to be sufficient, as it was shown that networks with only one hidden layer can model "any function of interest to an arbitrary selected precision". However, the number of neurons might be really huge and much larger than if one used more layers.

## Related Work

See [Meiosis Networks summary](http://www.shortscience.org/paper?bibtexKey=conf/nips/Hanson89#martinthoma) for many topology learning papers

nips.djvuzone.org
sci-hub
scholar.google.com

Meiosis Networks
Hanson, Stephen Jose
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

This paper is about topology learning (also called *structural learning* as in contrast to *parametric learning*) for neural networks. Instead of taking deterministic weights, each weight $w_{ij}$ is normal distributed ($w_{ij} \sim \mathcal{N}(\mu_{ij}, \sigma^2_{ij})$). Hence every connection has two learned parameters: $\mu_{ij}$ and $\sigma^2_{ij}$.

Meiosis is cell division. So meiosis networks split nodes under some conditions.

The "topology" being learned seems only to add single neurons to the given layers. It is not able to add new layers or add skip connections.

## Chapters

* 1.1 Learning and Search: The author seems to describe the [VC dimension](https://en.wikipedia.org/wiki/VC_dimension).
* 1.2 Stochastic Delta Rule: Explains how to update the weights parameters.
* 1.3 Meiosis: Networks variance is initialized randomly with $\sigma_i^2 \sim U([-10, 10])$ (negative variance???). A node $j$ is splitted, when the random part dominates the value of the sampled weights: $$\frac{\sum_i \sigma_{ij}}{\sum_i \mu_{ij}} > 1 \text{ and } \frac{\sum_k \sigma_{jk}}{\sum_k \mu_{jk}} > 1$$ The mean of the new nodes is sampled around the old mean (TODO: how is it sampled?), half the variance is assigned to the new connections.
* 1.4 Examples: XOR, 3-bit parity, blood NMR data, learning curves. Learning rate of $\eta = 0.5$, momentum of $\alpha = 0.75$.
* 1.5 Conclusion:

## What I don't understand

1. In the present approach, weights reflect a coarse prediction history as coded by a distribution of values and parameterized in the mean and standard deviation of these weight distributions. 
2. The first formula (1)
3. What negative variance is.
4. How exactly the means are sampled


## Related work

* Constructive Methods
  * 1989: [The Cascade-Correlation Learning Architecture](http://www.shortscience.org/paper?bibtexKey=conf/nips/FahlmanL89)
  * 1989: [Dynamic Node Creation in Backpropagation Networks](http://www.shortscience.org/paper?bibtexKey=ash:dynamic): Only one hidden layer
* Pruning methods
  * 1989: [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)
  * 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92)
  * 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
  * 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)

arxiv.org
arxiv-vanity.com
scholar.google.com

Language Modeling with Gated Convolutional Networks
Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Martin Thoma 7 years ago

This paper is about a new model for language which uses a convolutional approach instead of LSTMs.


## General Language modeling

Statistical language models estimate the probability distribution of a sequence of words. They are important for ASR (automatic speech recognition) and translation. The usual approach is to embedd words into $\mathbb{R}^n$ and then apply RNNs to the vector sequences.


## Evaluation

* [WikiText-103](http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/): [Perplexity](https://en.wikipedia.org/wiki/Perplexity) of 44.9 (lower is better)
* new best single-GPU result on the Google Billion Word benchmark: Perplexity of 43.9


## Idea

* uses Gated Linear Units (GLU)
* uses pre-activation residual blocks
* adaptive softmax
* no tanh in the gating mechanism
* use gradient clipping

## See also

* [Reddit](https://www.reddit.com/r/MachineLearning/comments/5kbsjb/r_161208083_language_modeling_with_gated/)
* [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/abs/1612.04426): Test perplexity of **40.8 on WikiText-103**

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning both Weights and Connections for Efficient Neural Networks
Song Han and Jeff Pool and John Tran and William J. Dally
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper is about pruning a neural network to reduce the FLOPs and memory necessary to use it. This method reduces AlexNet parameters to 1/9  and VGG-16 to 1/13 of the original size.

## Receipt

1. Train a network
2. Prune network: For each weight $w$: if w < threshold, then w <- 0.
3. Train pruned network

## See also

* [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)

nips.djvuzone.org
sci-hub
scholar.google.com

Second Order Derivatives for Network Pruning: Optimal Brain Surgeon
Hassibi, Babak and Stork, David G.
Neural Information Processing Systems Conference - 1992 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

Optimal Brain Surgeon (OBS) is about pruning neural networks to minimize the amount of parameters, training time and overfitting. It is very similar to [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89#martinthoma), but claims to choose better weights. However, it does require to compute the inverse hessian. The hessian matrix of a neural network is a $n \times n$ matrix, where $n$ is the number of parameters of the network. Typically, $n > 10^6$. This makes the approach unusable.

nips.djvuzone.org
sci-hub
scholar.google.com

Optimal Brain Damage
LeCun, Yann and Denker, John S. and Solla, Sara A.
Neural Information Processing Systems Conference - 1989 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

Optimal Brain Damage (OBD) is a techique to make a network smaller by pruning small weights.


## Idea

* use second-derivative information to make tradeoff between network complexity and training error
* do this while training to prevent overfitting / reduce the need for data / reduce training time
* **How to choose what to delete**: Weights which have least impact on training error. This is estimated by approximating the function with a Taylor series.

## Recipe

(Directly copied from the paper):

The OBD procedure can be carried out as follows:

1. Choose a reasonable network architecture
2. Train the network until a reasonable solution is obtained
3. Compute the second derivatives $h_{kk}$ for each parameter
4. Compute the saliencies for each parameter: $s_k = h_{kk} u_k^2 /2$
5. Sort the parameters by saliency and delete some low-saliency parameters
6. Iterate to step 2

Deleting a parameter is defined as setting it to 0 and freezing it there. Several
variants of the procedure can be devised, such as decreasing the values of the low-saliency parameters instead of simply setting them to 0, or allowing the deleted
parameters to adapt again after they have been set to 0.

## See also

* 1989: Optimal Brain Damage ([original pdf](https://papers.nips.cc/paper/250-optimal-brain-damage.pdf), [nice pdf](http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf), [txt](https://github.com/NicolasEstrada/nlp/blob/master/nipstxt/nips02/0598.txt))
* 1993: [Optimal Brain Surgeon](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiS92) ([pdf](https://papers.nips.cc/paper/647-second-order-derivatives-for-network-pruning-optimal-brain-surgeon.pdf) and [follow-up](http://www.shortscience.org/paper?bibtexKey=conf/nips/HassibiSW93), [2](http://www.shortscience.org/paper?bibtexKey=conf/epia/EndischHS07))
* 1998: LeNet-5
* 2012: AlexNet
* 2015: [Learning both Weights and Connections for Efficient Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1506.02626)
* 2016: [Neural networks with differentiable structure](http://www.shortscience.org/paper?bibtexKey=journals%2Fcorr%2F1606.06216#martinthoma)

arxiv.org
arxiv-vanity.com
scholar.google.com

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 7 years ago

Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image).

## Idea

* Convolutional layers operate on any size, but fully connected layers need fixed-size inputs
* Solution:
  * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer
  * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept.
  * The SPP layer operates on each feature map independently.
  * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins.

Example: We could use spatial pyramid pooling with 21 bins:

* 1 bin which is the max of the complete feature map
* 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.
* 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.

## Evaluation

* Pascal VOC 2007, Caltech101: state-of-the-art, without finetuning
* ImageNet 2012: Boosts accuracy for various CNN architectures
* ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2


## Code

The paper claims that the code is [here](http://research.microsoft.com/en-us/um/people/kahe/), but this seems not to be the case any more.

People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available.


## Related papers

* [Atrous Convolution](https://arxiv.org/abs/1606.00915)

1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI, stat.ML
more

[link] Summary by Martin Thoma 7 years ago

This paper describes how to find local interpretable model-agnostic explanations (LIME) why a black-box model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain.

For computer vision / image classification, the image $x$ is divided into superpixels. Single super-pixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. 

The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7-JtFMLo4) by Marco Tulio Ribeiro.

A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma).

## Follow-up Paper

* June 2016: [Model-Agnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386)
* November 2016:
  * [Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817)
  * [An unexpected unity among methods for interpreting
model predictions](https://arxiv.org/abs/1611.07478)

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural networks with differentiable structure
Thomas Miconi
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE
more

[link] Summary by Martin Thoma 7 years ago

The paper describes a procedure for topology pruning based on L1 norm to make weights small and a threshold for deleting them alltogether.

It is similar to [Optimal Brain Damage](https://arxiv.org/abs/1606.06216) and [Optimal Brain Surgeon](http://ee.caltech.edu/Babak/pubs/conferences/00298572.pdf).

arxiv.org
arxiv-vanity.com
scholar.google.com

Striving for Simplicity: The All Convolutional Net
Jost Tobias Springenberg and Alexey Dosovitskiy and Thomas Brox and Martin Riedmiller
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.LG, cs.CV, cs.NE
more

[link] Summary by Martin Thoma 8 years ago

A paper in the intersection for Computer Vision and Machine Learning. They simplify networks by replacing max-pooling by convolutions with higher stride.

* introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches


## Datasets

competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet)

arxiv.org
arxiv-vanity.com
scholar.google.com

Network In Network
Min Lin and Qiang Chen and Shuicheng Yan
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 8 years ago

A paper in the intersection for Computer Vision and Machine Learning. They propose a method (network in network) to reduce parameters. Essentially, it boils down to a pattern of (conv with size > 1) -> (1x1 conv) -> (1x1 conv) -> repeat

## Datasets
state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST

## Implementations

* [Lasagne](https://github.com/Lasagne/Recipes/blob/master/modelzoo/cifar10_nin.py)

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 8 years ago

One problem of training deep networks is that the features of lower-layer networks change while the upper-layer networks have already been adjusted to the previous lower-layer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*.

Batch normalization is done at training time for each mini batch.

## Ideas

* Training converges faster, if input is whitened (zero means, unit variances, decorrelated).
* Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up

## What Batch Normalization is

For a layer with $d$-dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize
each dimension 
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though.

Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature:

$$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$

Those two parameters (per feature) are learnable!

## Effect of Batch normalization

* Higher learning rates can be used
* Initialization is less important
* Acts as a regularizer, eliminating the need for dropout in some cases
* Faster training

## Datasets

* reaching 4.9% top-5 validation error (and 4.8% test error) on ImageNet classification

## Used by

* [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)
* [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma)

## See also

* [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)

dl.acm.org
sci-hub
scholar.google.com

Dropout: a simple way to prevent neural networks from overfitting
Srivastava, Nitish and Hinton, Geoffrey E. and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan
Journal of Machine Learning Research - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

This paper is a much better introduction to Dropout than [Improving neural networks by preventing
co-adaptation of feature detectors](http://www.shortscience.org/paper?bibtexKey=journals/corr/1207.0580), written by the same authors two years later.

## General idea of Dropout

Dropout is a layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.


## Interpretations

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It can also be seen as a regularizer.

arxiv.org
arxiv-vanity.com
scholar.google.com

An empirical analysis of dropout in piecewise linear networks
David Warde-Farley and Ian J. Goodfellow and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2013 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

1	[link] Summary by Martin Thoma 8 years ago This paper analyses fully connected networks with dropout and ReLU activation functions. more less

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Dropout Training for Convolutional Neural Networks
Haibing Wu and Xiaodong Gu
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV, cs.NE
more

1	[link] Summary by Martin Thoma 8 years ago Probabilistic weighted pooling is proposed in this paper. It is based on max-pooling and dropout. more less

www.icml2010.org
sci-hub
scholar.google.com

A Theoretical Analysis of Feature Pooling in Visual Recognition
Boureau, Y-Lan and Ponce, Jean and LeCun, Yann
International Conference on Machine Learning - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

This paper analyzes max pooling and average pooling, as it is used in many convolutional neural networks (CNNs).

## Why pooling is used

* invariance to image transformations
* more compact representations (- remove irrelevant information)
* better robustness to noise and clutter


## Max pooling or average pooling?

No clear answer to that. Sometimes one seems to be better, sometimes the other, sometimes something in between.

dx.doi.org
sci-hub
scholar.google.com

Deep Face Recognition
Parkhi, Omkar M. and Vedaldi, Andrea and Zisserman, Andrew
BMVA Press BMVC - 2015 via Local Bibsonomy
Keywords: dblp

1	[link] Summary by Martin Thoma 8 years ago This paper is about data collection for face recognition. One idea was to use weaker classifiers to rank the data presented to the annotators. more less

arxiv.org
scholar.google.com

Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D. and Fergus, Rob
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: cnn, deeplearning

[link] Summary by Martin Thoma 8 years ago

The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis.

## Analyzation techniques
### Visualization

A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached.

The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling.

### Occlusion sensitivity analysis

* Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier.
* Create an image like this:
    * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride)
    * At $(x, y)$, either ...
        * (d) ... place a pixel which color-encodes the probability of the correct class
        * (e) ... place a pixel which color-encodes the most probable class

The following image from the Zeiler & Fergus paper visualizes this pretty well:

If the dogs face is occluded, the probability of the correct class drops a lot:
![Imgur](http://i.imgur.com/Q1Ama2z.png)

If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian".
![Imgur](http://i.imgur.com/5QYKh7b.png)

See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma).


## How visualization helped to construct ZF-Net

* "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$
* "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2
* The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. 

## ZF-Net
Zeiler and Fergus also created a new network for ImageNet.

The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers.

Training setup:

* **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region
* **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs
* **Resources**: took around 12 days on a single GTX580 GPU

The network was evaluated on

* ImageNet 2012: 14.8% error
* Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet)
* Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet)

## Minor errors

* typo: "goes give" (also: something went wrong with the link there - the whole block is a link)

arxiv.org
arxiv-vanity.com
scholar.google.com

FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff and Dmitry Kalenichenko and James Philbin
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 8 years ago

FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).

The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.

## LMNN

Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric

$$d(x, y) = (x -y) M  (x -y)^T$$

where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.

## Curriculum Learning: Triplet selection

Show simple examples first, then increase the difficulty. This is done by selecting the triplets.

They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.

They want to have

$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$

where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.

## Tasks

* **Face verification**: Is this the same person?
* **Face recognition**: Who is this person?

## Datasets

* 99.63% accuracy on Labeled FAces in the Wild (LFW)
* 95.12% accuracy on YouTube Faces DB

## Network

Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13)  and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14).

## See also

* [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma)

arxiv.org
arxiv-vanity.com
scholar.google.com

Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 8 years ago

This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.

A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14).

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet).

scholar.google.com

Neocognitron: A self-organizing neural network model for a mechanish of pattern recognition unaffected by shifts in position
Fukushima, K.
Biological Cybernetics - 1980 via Local Bibsonomy
Keywords: deep, neural, learning, fukushima, neocognitron, networks

1	[link] Summary by Martin Thoma 8 years ago This was an inspiration for CNNs (convolutional neural networks). See also: * [Neocognitron](https://en.wikipedia.org/wiki/Neocognitron) more less

arxiv.org
scholar.google.com

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size
Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

This paper is about the reduction of model parameters while maintaining (most) of the models accuracy.

The paper gives a nice overview over some key findings about CNNs. One part that is especially interesting is "2.4. Neural Network Design Space Exploration".

## Model compression

Key ideas for model compression are:

* singular value decomposition (SVD)
* replace parameters that are below a certain threshold with zeros to form a sparse matrix
* combining Network Pruning with quantization (to 8 bits or less)
* huffman encoding (Deep Compression)

Ideas used by this paper are

* Replacing 3x3 filters by 1x1 filters
* Decrease the number of input channels by using **squeeze layers**

One key idea to maintain high accuracy is to downsample late in the network. This means close to the input layer, the layer parameters have stride = 1, later they have stride > 1.


## Fire module

A Fire module is a squeeze convolution layer (which has only $n_1$ 1x1 filters), feeding into an expand layer that has a mix of $n_2$ 1x1 and $n_3$ 3x3 convolution filters. It is chosen 
$$n_1 < n_2 + n_3$$
(Why?)

(to be continued)

arxiv.org
arxiv-vanity.com
scholar.google.com

Fully Convolutional Networks for Semantic Segmentation
Jonathan Long and Evan Shelhamer and Trevor Darrell
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 8 years ago

## Terms

* Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background).


## Main ideas

* Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet)
* Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.)


## How FCNs work

1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$)
2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained.
3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$
4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output.

## Nice properties

* FCNs take images of arbitrary size and produce an image of the same output size.
* Computationally efficient

## See also:

https://www.quora.com/What-are-the-benefits-of-converting-a-fully-connected-layer-in-a-deep-neural-network-to-an-equivalent-convolutional-layer

> They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output.
>
> Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of  10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.

1 Comments

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

DeepFace: Closing the Gap to Human-Level Performance in Face Verification
Taigman, Yaniv and Yang, Ming and Ranzato, Marc'Aurelio and Wolf, Lior
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

## General stuff about face recognition

Face recognition has 4 main tasks:

* **Face detection**: Given an image, draw a rectangle around every face
* **Face alignment**: Transform a face to be in a canonical pose
* **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes)
* **Face verification**: Images of two faces are given. Decide if it is the same person or not.

The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is).

Datasets being used are:

* **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities
* **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects
* **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces
* **USF** (Human-ID database): 3D scans of faces

## Ideas in this paper

This paper deals with face alignment and face representation.

**Face Alignment**

They made an average face with the USF dataset. Then, for each new face, they apply the following procedure:

* Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip)
* Crop according to those
* Find 67 points in the face / apply them to a normalized 3D model of a face
* Transform (=align) face to a normalized position

**Representation**

Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation.

The network is:

* C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images")
* M2 (max pooling): $3 \times 3$, stride of 2  (returns $71\times 71$ "images")
* C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images")
* L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images")
* L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images")
* L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images")
* F7 (fully connected): ReLU, 4096 units
* F8 (fully connected): softmax layer with 4030 output neurons

The training was done with:

* Stochastic Gradient Descent (SGD)
* Momentum of 0.9
* Performance scheduling (LR starting at 0.01, ending at 0.0001)
* Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$
* ~15 epochs ($\approx$ 3 days) of training


## Evaluation results

* **Quality**:
  * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW
  * 91.4% accuracy with a single network on YTF
* **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output

## See also

* Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI)

www.jmlr.org
scholar.google.com

Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier and Bengio, Yoshua
Journal of Machine Learning Research - 2010 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**

$$W \sim U \left [  - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$

where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.

Showing some ways **how to debug neural networks** might be another reason to read the paper.

The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).

However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.

Questions that remain open for me:

* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?

1 Comments

doi.acm.org
sci-hub
scholar.google.com

MetaCost: A General Method for Making Classifiers Cost-Sensitive
Domingos, Pedro M.
ACM KDD - 1999 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

MetaCost is a meta-algorithm which makes error-based classifiers making their decision based on the cost of errors. For example, sending advertisement is cheap, so it might be worth a lot of false positives to get a single person who is actually interested in the advertisement.

The algorithm is given in pseudocode in the paper.

Important notation:

* $C(i, j)$: Cost of predicting an example belongs to class $i$, where in fact it belongs to class $j$.

papers.nips.cc
scholar.google.com

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)).

The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid.

RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score.

## See also
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

The [R-CNN](http://arxiv.org/abs/1311.2524) paper presents a method based on convolutional neural networks (CNNs) for object detection. It does so by region proposals (hence the "R"). The key insight was to train CNNs on classification tasks and use the learned features for the region proposals. The do *not* use a sliding window approach such as Overfeat. They create around 2000 category-independent region proposals. For each proposal, they crop the part of that image. Then they resize the cropped part to fit into the CNN and classify it.


Notable follow-ups are:

* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15)
* [Faster R-CNNs](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15)

papers.nips.cc
scholar.google.com

Deep Neural Networks for Object Detection
Szegedy, Christian and Toshev, Alexander and Erhan, Dumitru
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction.

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Martin Thoma 8 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 8 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

Deep Networks with Stochastic Depth
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: deeplearning, acreuser

[link] Summary by Martin Thoma 8 years ago

**Dropout for layers** sums it up pretty well. The authors built on the idea of [deep residual networks](http://arxiv.org/abs/1512.03385) to use identity functions to skip layers. 

The main advantages:

* Training speed-ups by about 25%
* Huge networks without overfitting

## Evaluation

* [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html): 4.91% error ([SotA](https://martin-thoma.com/sota/#image-classification): 2.72 %) Training Time: ~15h
* [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html): 24.58% ([SotA](https://martin-thoma.com/sota/#image-classification): 17.18 %) Training time: < 16h
* [SVHN](http://ufldl.stanford.edu/housenumbers/):  1.75% ([SotA](https://martin-thoma.com/sota/#image-classification): 1.59 %) - trained for 50 epochs, begging with a LR of 0.1, divided by 10 after 30 epochs and 35. Training time: < 26h

Martin Thoma

sciscore: 2.064