Welcome to ShortScience.org! |
[link]
## General stuff about face recognition Face recognition has 4 main tasks: * **Face detection**: Given an image, draw a rectangle around every face * **Face alignment**: Transform a face to be in a canonical pose * **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes) * **Face verification**: Images of two faces are given. Decide if it is the same person or not. The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is). Datasets being used are: * **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities * **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects * **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces * **USF** (Human-ID database): 3D scans of faces ## Ideas in this paper This paper deals with face alignment and face representation. **Face Alignment** They made an average face with the USF dataset. Then, for each new face, they apply the following procedure: * Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip) * Crop according to those * Find 67 points in the face / apply them to a normalized 3D model of a face * Transform (=align) face to a normalized position **Representation** Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation. The network is: * C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images") * M2 (max pooling): $3 \times 3$, stride of 2 (returns $71\times 71$ "images") * C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images") * L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images") * L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images") * L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images") * F7 (fully connected): ReLU, 4096 units * F8 (fully connected): softmax layer with 4030 output neurons The training was done with: * Stochastic Gradient Descent (SGD) * Momentum of 0.9 * Performance scheduling (LR starting at 0.01, ending at 0.0001) * Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$ * ~15 epochs ($\approx$ 3 days) of training ## Evaluation results * **Quality**: * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW * 91.4% accuracy with a single network on YTF * **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output ## See also * Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI) |
[link]
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. |
[link]
This paper gets a face image and changes its pose or rotates it (to any desired pose) by passing the target pose as the input to the model. https://i.imgur.com/AGNOag5.png They use a GAN (named DR-GAN) for face rotation. The gan has an encoder and a decoder. The encoder takes the image and gets a high-level feature representation. The decoder gets high-level features, the target pose, and some noise to generate the output image with rotated face. The generated image is then passed to a discriminator where it says whether the image is real or fake. The disc also has two other outputs: 1- it estimates the pose of the generated image, 2) it estimated the identity of the person. no direct loss is applied to the generator, it is trained by the gradient that it gets through discriminator to minimize the three objects: 1- gan loss (to fool disc) 2-pose estimation 3- identity estimation. They use two tricks to improve the model: 1- using the same parameters for encoder of generator (gen-enc) and the discriminator (they observe this helps better identity recognition) 2- passing two images to gen-enc and interpolating between their high-level features (gen-enc output) and then applying two costs on it: 1) gan loss 2) pose loss. These losses are applied through disc, similar to above. The first trick improves gen-enc and second trick improves gen-dec, both help on identification. Their model can also leverage multiple image of the same identity if the dataset provides that to get better latent representation in gen-enc for a given identity. https://i.imgur.com/23Tckqc.png These are some samples on face frontalization: https://i.imgur.com/zmCODXe.png and these are some samples on interpolating different features in latent space: (sub-fig a) interpolating f(x) between the latent space of two images, (sub-fig b) interpolating pose (c), (sub-fig c) interpolating noise: https://i.imgur.com/KlkVyp9.png I find these positive aspects about the paper: 1) face rotation is applied on the images in the wild, 2) It is not required to have paired data. 3) multiple source images of the same identity can be used if provided, 4) identity and pose are used smartly in the discriminator to guide the generator, 5) model can specify the target pose (it is not only face-frontalization). Negative aspects: 1) face has many artifacts, similar to artifacts of some other gan models. 2) The identity is not well-preserved and the faces seem sometime distorted compared to the original person. They show the models performance on identity recognition and face rotation and demonstrate compelling results. |
[link]
This paper describes a class of algorithms for classification or regression in the on-line setting. That is, the data is a bunch of pairs $(X_t,Y_t)$ (where X may be a vector), and these data items arrive in some order: the algorithm must predict each $\hat{Y}_t$ using only the $X_t$ and previously seen pairs. In the regression setting, each mis-prediction has a loss that is like $(Y_t - \hat{Y}_t)^2$, and in the classification setting $Y_t$ is always 0 or 1 and the loss is $| Y_t - \hat{Y}_t |$. Roughly, the algorithm makes linear predictions using some internal weight vector $(\hat{y} = w * X)$, and does a gradient-descent like weight update. However, it tries to keep the q-norm (q can be any number) of the weight vector "small", preventing the weights themselves from becoming too large. The algorithm is actually simple, and the weight update takes advantage of link functions, which the author defines. The majority of the paper is focused on deriving loss bounds, showing that the loss incurred by this algorithm isn't much worse than that incurred by the best weight vector, chosen in hindsight. Typical readers will be interested in the first few pages, as the latter part of the paper is mainly technical proofs. |
[link]
### Introduction * *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks. * Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. * [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper. ### Contributions of the paper * Explore cases that show that curriculum learning benefits machine learning. * Offer hypothesis around when and why does it happen. * Explore relation of curriculum learning with other machine learning approaches. ### Experiments with convex criteria * Training perceptron where some input data is irrelevant(not predictive of the target class). * Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane. * Curriculum learning model outperforms no-curriculum based approach. * Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy. ### Experiments on shape recognition with datasets having different variability in shapes * Standard(target) dataset - Images of rectangles, ellipses, and triangles. * Easy dataset - Images of squares, circles, and equilateral triangles. * Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*). * For no-curriculum learning, the first epoch is the *switch epoch*. * As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs. * Paper does not report results for higher values of *switch epoch*. ### Experiments on language modeling * Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words. * Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary. * Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt). * The model predicts the score of next word, given a window of words. * Expected value of ranking loss function is minimized to learn **W**. * Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words. ### Curriculum as a continuation method * Continuation methods start with a smoothed objective function and gradually move to less smoothed function. * Useful in the case where the objective function in non-convex. * Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function. * Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$. * Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$. * Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective. * The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step). ### Advantages of Curriculum Learning * Faster training in the online setting as learner does not try to learn difficult examples when it is not ready. * Guiding training towards better local minima in parameter space, specifically useful for non-convex methods. ### Relation to other machine learning approaches * **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error. * **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy. * **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change. * **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem. ### Criticism * Curriculum Learning is not well understood, making it difficult to define the curriculum. * In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse. |