Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1547 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

FaceNet: A Unified Embedding for Face Recognition and Clustering

Florian Schroff and Dmitry Kalenichenko and James Philbin

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV

**First published:** 2015/03/12 (6 years ago)

**Abstract:** Despite significant recent advances in the field of face recognition,
implementing face verification and recognition efficiently at scale presents
serious challenges to current approaches. In this paper we present a system,
called FaceNet, that directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a measure of face
similarity. Once this space has been produced, tasks such as face recognition,
verification and clustering can be easily implemented using standard techniques
with FaceNet embeddings as feature vectors.
Our method uses a deep convolutional network trained to directly optimize the
embedding itself, rather than an intermediate bottleneck layer as in previous
deep learning approaches. To train, we use triplets of roughly aligned matching
/ non-matching face patches generated using a novel online triplet mining
method. The benefit of our approach is much greater representational
efficiency: we achieve state-of-the-art face recognition performance using only
128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW) dataset, our system
achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves
95.12%. Our system cuts the error rate in comparison to the best published
result by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet
loss, which describe different versions of face embeddings (produced by
different networks) that are compatible to each other and allow for direct
comparison between each other.
more
less

Florian Schroff and Dmitry Kalenichenko and James Philbin

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV

[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric $$d(x, y) = (x -y) M (x -y)^T$$ where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) |

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

**First published:** 2015/02/11 (6 years ago)

**Abstract:** Training Deep Neural Networks is complicated by the fact that the
distribution of each layer's inputs changes during training, as the parameters
of the previous layers change. This slows down the training by requiring lower
learning rates and careful parameter initialization, and makes it notoriously
hard to train models with saturating nonlinearities. We refer to this
phenomenon as internal covariate shift, and address the problem by normalizing
layer inputs. Our method draws its strength from making normalization a part of
the model architecture and performing the normalization for each training
mini-batch. Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer, in some
cases eliminating the need for Dropout. Applied to a state-of-the-art image
classification model, Batch Normalization achieves the same accuracy with 14
times fewer training steps, and beats the original model by a significant
margin. Using an ensemble of batch-normalized networks, we improve upon the
best published result on ImageNet classification: reaching 4.9% top-5
validation error (and 4.8% test error), exceeding the accuracy of human raters.
more
less

Sergey Ioffe and Christian Szegedy

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.LG

[link]
### What is BN: * Batch Normalization (BN) is a normalization method/layer for neural networks. * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*. * BN essentially performs Whitening to the intermediate layers of the networks. ### How its calculated: * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch. * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases. * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance). ### Theoretical effects: * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active. * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions). * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on. ### Practical effects: * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.) * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.) * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.) ![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations") *BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.* ------------------------- ### Rough chapter-wise notes * (2) Towards Reducing Covariate Shift * Batch Normalization (*BN*) is a special normalization method for neural networks. * In neural networks, the inputs to each layer depend on the outputs of all previous layers. * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*. * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions). * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network. * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance). * That accomplishes: * No more covariate shift. * Fixes problems with vanishing gradients due to saturation. * Effects: * Networks learn faster. (As they don't have to adjust to covariate shift any more.) * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.) * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.) * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.) * BN reduces the need for dropout. (As it has a regularizing effect.) * How BN works: * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*. * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. * A proper method has to include the current example *and* all previous examples in the normalization step. * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way. * (3) Normalization via Mini-Batch Statistics * Each feature (component) is normalized individually. (Due to cost, differentiability.) * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))` * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function. * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component) * E and Var are estimated for each mini batch. * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left). * (3.1) Training and Inference with Batch-Normalized Networks * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training. * During test time, the BN formulas can be simplified to a single linear transformation. * (3.2) Batch-Normalized Convolutional Networks * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities. * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian. * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason). * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta). * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m. * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.) * (3.3) Batch Normalization enables higher learning rates * BN normalizes activations. * Result: Changes to early layers don't amplify towards the end. * BN makes it less likely to get stuck in the saturating parts of nonlinearities. * BN makes training more resilient to parameter scales. * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions. * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used. * (something about singular values and the Jacobian) * (3.4) Batch Normalization regularizes the model * Usually: Examples are seen on their own by the network. * With BN: Examples are seen in conjunction with other examples (mean, variance). * Result: Network can't easily memorize the examples any more. * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength. * (4) Experiments * (4.1) Activations over time ** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.) ** Batch Size was 60. ** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training. ** Generalization of the BN network seemed to be better. * (4.2) ImageNet classification ** They applied BN to the Inception network. ** Batch Size was 32. ** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation. ** They shuffle the data during training (i.e. each batch contains different examples). ** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate). ** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU). ** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet. * (5) Conclusion ** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities. ** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function). |

Towards Stable and Efficient Training of Verifiably Robust Neural Networks

Zhang, Huan and Chen, Hongge and Xiao, Chaowei and Li, Bo and Boning, Duane S. and Hsieh, Cho-Jui

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Zhang, Huan and Chen, Hongge and Xiao, Chaowei and Li, Bo and Boning, Duane S. and Hsieh, Cho-Jui

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Zhang et al. combine interval bound propagation and CROWN, both approaches to obtain bounds on a network’s output, to efficiently train robust networks. Both interval bound propagation (IBP) and CROWN allow to bound a network’s output for a specific set of allowed perturbations around clean input examples. These bounds can be used for adversarial training. The motivation to combine BROWN and IBP stems from the fact that training using IBP bounds usually results in instabilities, while training with CROWN bounds usually leads to over-regularization. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - 2019 via Local Bibsonomy

Keywords: 3D, Human, estimation, pose

Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - 2019 via Local Bibsonomy

Keywords: 3D, Human, estimation, pose

[link]
This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask R-CNN with ResNet-101 backbone, pre-trained on COCO and fine-tuned on 2D projections from Human3.6M, is used in the paper. https://i.imgur.com/CdQONiN.png The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions. https://i.imgur.com/tCZvt6M.png The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zero-padding is not used in the convolution operations. 3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semi-supervised approach of training the 2D->3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance. https://i.imgur.com/CBHxFOd.png In the semi-supervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean per-joint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal. The two networks (`supervised` above, `semi-supervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semi-supervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match. The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semi-supervised` approach becomes more effective when less labeled data is available. https://i.imgur.com/bHpVcSi.png Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting. https://i.imgur.com/rhhTDfo.png |

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |

About