[link]
_Objective:_ Design Feed-Forward Neural Network (fully connected) that can be trained even with very deep architectures. * _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html), [Tox21](https://tripod.nih.gov/tox21/challenge/) and [UCI tasks](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits). * _Code:_ [here](https://github.com/bioinf-jku/SNNs) ## Inner-workings: They introduce a new activation functio the Scaled Exponential Linear Unit (SELU) which has the nice property of making neuron activations converge to a fixed point with zero-mean and unit-variance. They also demonstrate that upper and lower bounds and the variance and mean for very mild conditions which basically means that there will be no exploding or vanishing gradients. The activation function is: [![screen shot 2017-06-14 at 11 38 27 am](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png)](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png) With specific parameters for alpha and lambda to ensure the previous properties. The tensorflow impementation is: def selu(x): alpha = 1.6732632423543772848170429916717 scale = 1.0507009873554804934193349852946 return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha) They also introduce a new dropout (alpha-dropout) to compensate for the fact that [![screen shot 2017-06-14 at 11 44 42 am](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png)](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png) ## Results: Batch norm becomes obsolete and they are also able to train deeper architectures. This becomes a good choice to replace shallow architectures where random forest or SVM used to be the best results. They outperform most other techniques on small datasets. [![screen shot 2017-06-14 at 11 36 30 am](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png)](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png) Might become a new standard for fully-connected activations in the future. |
[link]
_Objective:_ Analyze large scale dataset of fashion images to discover visually consistent style clusters. * _Dataset:_ StreetStye-27K. * _Code:_ demo [here](http://streetstyle.cs.cornell.edu/) ## New dataset: StreetStye-27K 1. **Photos (100 million)**: from Instagram using the [API](https://www.instagram.com/developer/) to retrieve images with the correct location and time. 2. **People (14.5 million)**: they run two algorithms to normalize the body position in the image: * [Face++](http://www.faceplusplus.com/) to detect and localize faces. * [Deformable Part Model](http://people.cs.uchicago.edu/%7Erbg/latent-release5/) to estimate the visibility of the rest of the body. 3. **Clothing annotations (27K)**: Amazon Mechanical Turk with quality control. 4000$ for the whole dataset. ## Architecture: Usual GoogLeNet but they use [Isotonice Regression](http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/) to correct the bias. ## Unsupervised clustering: They proceed as follow: 1. Compute the features embedding for a subset of the overall dataset selected to represent location and time. 2. Apply L2 normalization. 3. Use PCA to find the vector representing 90% of the variance (165 here). 4. Cluster them using a [GMM](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) with 400 mixtures which represent the clusters. They compute fashion clusters for city or bigger entities: [![screen shot 2017-06-15 at 12 04 06 pm](https://user-images.githubusercontent.com/17261080/27176447-d33fc2dc-51c2-11e7-9191-dbf972ee96a1.png)](https://user-images.githubusercontent.com/17261080/27176447-d33fc2dc-51c2-11e7-9191-dbf972ee96a1.png) ## Results: Pretty standard techniques but all patched together to produce interesting visualizations. |
[link]
_Objective:_ Transfer visual attribute (color, tone, texture, and style, etc) between two semantically-meaningful images such as a picture and a sketch. ## Inner workings: ### Image analogy An image analogy A:A′::B:B′ is a relation where: * B′ relates to B in the same way as A′ relates to A * A and A′ are in pixel-wise correspondences * B and B′ are in pixel-wise correspondences In this paper only a source image A and an example image B′ are given, and both A′ and B represent latent images to be estimated. [![screen shot 2017-05-18 at 10 43 48 am](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png)](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png) ### Dense correspondence In order to find dense correspondences between two images they use features from previously trained CNN (VGG-19) and retrieve all the ReLU layers. The mapping is divided in two sub-mappings that are easier to compute, first a visual attribute transformation and then a space transformation. [![screen shot 2017-05-18 at 11 04 58 am](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png)](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png) ## Architecture: The algorithm proceeds as follow: 1. Compute features at each layer for the input image using a pre-trained CNN and initialize feature maps of latent images with coarsest layer. 2. For said layer compute a forward and reverse nearest-neighbor field (NNF, basically an offset field). 3. Use this NNF with the feature of the input current layer to compute the features of the latent images. 4. Upsample the NNF and use it as the initialization for the NNF of the next layer. [![screen shot 2017-05-18 at 11 14 33 am](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png)](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png) ## Results: Impressive quality on all type of visual transfer but veryyyyy slow! (~3min on GPUs for one image). [![screen shot 2017-05-18 at 11 36 47 am](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png)](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png) |
[link]
Generate code from a UI screenshot. _Code:_ [Demo](https://youtu.be/pqKeXkhFA3I) and [code](https://github.com/tonybeltramelli/pix2code) to come. ## Inner-workings: Decomposed the problem in three steps: 1. a computer vision problem of understanding the given scene and inferring the objects present, their identities, positions, and poses. 2. a language modeling problem of understanding computer code and generating syntactically and semantically correct samples. 3. use the solutions to both previous sub-problems by exploiting the latent variables inferred from scene understanding to generate corresponding textual descriptions of the objects represented by these variables. They also introduce a Domain Specific Languages (DSL) for modeling purposes. ## Architecture: * Vision model: usual AlexNet-like architecture * Language model: use onehot encoding for the words in the DSL vocabulary which is then fed into a LSTM * Combined model: LSTM too. [![screen shot 2017-06-16 at 11 34 28 am](https://user-images.githubusercontent.com/17261080/27221124-c9cadcc6-5287-11e7-9d38-c4234af92912.png)](https://user-images.githubusercontent.com/17261080/27221124-c9cadcc6-5287-11e7-9d38-c4234af92912.png) ## Results: Clearly not ready for any serious use but promising results! [![screen shot 2017-06-16 at 11 57 45 am](https://user-images.githubusercontent.com/17261080/27222031-0bf8e7de-528b-11e7-896f-cdb410f928c3.png)](https://user-images.githubusercontent.com/17261080/27222031-0bf8e7de-528b-11e7-896f-cdb410f928c3.png) |
[link]
_Objective:_ Develop a platform to make AI accessible _Website:_ [here](http://pennai.org/) ## Inner-workings: Platform for AI with deep learning and genetic programming. More focused on biology. ## Architecture: [![screen shot 2017-06-26 at 11 00 07 am](https://user-images.githubusercontent.com/17261080/27690782-8b71f8c8-5ce2-11e7-9d84-77a4dd519e18.jpg)](https://user-images.githubusercontent.com/17261080/27690782-8b71f8c8-5ce2-11e7-9d84-77a4dd519e18.jpg) ## Results: Just announced, keep an eye on it. |
[link]
_Objective:_ Perform domain-adaptation by adapting several layers using a randomized representation and not just the final layer thus performing alignment of the joint distribution and not just the marginals. _Dataset:_ [Office](https://cs.stanford.edu/%7Ejhoffman/domainadapt/) and [ImageCLEF-DA1](http://imageclef.org/2014/adaptation). ## Inner-workings: Basically an improvement on [RevGrad](https://arxiv.org/pdf/1505.07818.pdf) where instead of using the last embedding layer for the discriminator, a bunch of them is used. To avoid dimension explosion when using the tensor product of all layers they instead use a randomized multi-linear representation: [![screen shot 2017-06-01 at 5 35 46 pm](https://cloud.githubusercontent.com/assets/17261080/26687736/cff20446-46f0-11e7-918e-b60baa10aa67.png)](https://cloud.githubusercontent.com/assets/17261080/26687736/cff20446-46f0-11e7-918e-b60baa10aa67.png) Where: * d is the dimension of the embedding (they use 1024) * R is random matrix for which each element as a null average and variance of 1 (Bernoulli, Gaussian and Uniform are tried) * z^l is the l-th layer * ⊙ represents the Hadamard product In practice they don't use all layers but just the 3-4 last layers for ResNet and AlexNet. ## Architecture: [![screen shot 2017-06-01 at 5 34 44 pm](https://cloud.githubusercontent.com/assets/17261080/26687686/acce0d98-46f0-11e7-89d1-15452cbb527e.png)](https://cloud.githubusercontent.com/assets/17261080/26687686/acce0d98-46f0-11e7-89d1-15452cbb527e.png) They use the usual losses for domain adaptation with: - F minimizing the cross-entropy loss for classification and trying to reduce the gap between the distributions (indicated by D). - D maximizing the gap between the distributions. [![screen shot 2017-06-01 at 5 40 53 pm](https://cloud.githubusercontent.com/assets/17261080/26687936/8575ff70-46f1-11e7-917d-05129ab190b0.png)](https://cloud.githubusercontent.com/assets/17261080/26687936/8575ff70-46f1-11e7-917d-05129ab190b0.png) ## Results: Improvement on state-of-the-art results for most tasks in the dataset, very easy to implement with any pre-trained network out of the box. |
[link]
_Objective:_ Replace the usual GAN loss with a softmax croos-entropy loss to stabilize GAN training. _Dataset:_ [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) ## Inner working: Linked to recent work such as WGAN or Loss-Sensitive GAN that focus on objective functions with non-vanishing gradients to avoid the situation where the discriminator `D` becomes too good and the gradient vanishes. Thus they first introduce two targets for the discriminator `D` and the generator `G`: [![screen shot 2017-04-24 at 6 18 11 pm](https://cloud.githubusercontent.com/assets/17261080/25347232/767049bc-291a-11e7-906e-c19a92bb7431.png)](https://cloud.githubusercontent.com/assets/17261080/25347232/767049bc-291a-11e7-906e-c19a92bb7431.png) [![screen shot 2017-04-24 at 6 18 24 pm](https://cloud.githubusercontent.com/assets/17261080/25347233/7670ff60-291a-11e7-974f-83eb9269d238.png)](https://cloud.githubusercontent.com/assets/17261080/25347233/7670ff60-291a-11e7-974f-83eb9269d238.png) And then the two new losses: [![screen shot 2017-04-24 at 6 19 50 pm](https://cloud.githubusercontent.com/assets/17261080/25347275/a303aa0a-291a-11e7-86b4-abd42c83d4a8.png)](https://cloud.githubusercontent.com/assets/17261080/25347275/a303aa0a-291a-11e7-86b4-abd42c83d4a8.png) [![screen shot 2017-04-24 at 6 19 55 pm](https://cloud.githubusercontent.com/assets/17261080/25347276/a307bc6c-291a-11e7-98b3-cbd7182090cd.png)](https://cloud.githubusercontent.com/assets/17261080/25347276/a307bc6c-291a-11e7-98b3-cbd7182090cd.png) ## Architecture: They use the DCGAN architecture and simply change the loss and remove the batch normalization and other empirical techniques used to stabilize training. They show that the soft-max GAN is still robust to training. |
[link]
_Objective:_ Use a GAN to learn an embedding invariant from domain shift. _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), [SVHN](http://ufldl.stanford.edu/housenumbers/), USPS, [OFFICE](https://cs.stanford.edu/%7Ejhoffman/domainadapt/) and [CFP](http://mukh.com/). ## Architecture: The total network is composed of several sub-networks: 1. `F`, the Feature embedding network that takes as input an image from either the source or target dataset and generate a feature vector. 2. `C`, the Classifier network when the image come from the source dataset. 3. `G`, the Generative network that learns to generate an image similar to the source dataset using an image embedding from `F` and a random noise vector. 4. `D`, the Discriminator network that tries to guess if an image is either from the source or the generative network. `G` and `D` play a minimax game where `D` tries to classify the generated samples as fake and `G` tries to fool `D` by producing examples that are as realistic as possible. The scheme for training the network is the following: [![screen shot 2017-04-14 at 5 50 22 pm](https://cloud.githubusercontent.com/assets/17261080/25048122/f2a648b6-213a-11e7-93bd-954981bd3838.png)](https://cloud.githubusercontent.com/assets/17261080/25048122/f2a648b6-213a-11e7-93bd-954981bd3838.png) ## Results: Very interesting, the generated image is just a side-product but the overall approach seems to be the state-of-the-art at the time of writing (the paper was published one week ago). |
[link]
_Objective:_ Reduce learning time for [DQN](https://deepmind.com/research/dqn/)-type architectures. They introduce a new network element, called DND (Differentiable Neural Dictionary) which is basically a dictionary that uses any key (especially embeddings) and computes the value by using kernel between keys. Plus it's differentiable. ## Architecture: They use basically a network in two steps: 1. A classical CNN network that computes and embedding for every image. 2. A DND for all possible actions (controller input) that stores the embedding as key and estimated reward as value. Also they use a buffer to store all tuples (previous image, action, reward, next image) and for training basic technique is used. [![screen shot 2017-04-12 at 11 23 32 am](https://cloud.githubusercontent.com/assets/17261080/24951103/92930022-1f73-11e7-97d2-628e2f4b5a33.png)](https://cloud.githubusercontent.com/assets/17261080/24951103/92930022-1f73-11e7-97d2-628e2f4b5a33.png) ## Results: Clearly improves learning speed but in the end other techniques catchup and it gets outperformed. |
[link]
_Objective:_ Image segmentation and pose estimation with an extension of Faster R-CNN. _Dataset:_ [COCO](http://mscoco.org/) and [Cityscapes](https://www.cityscapes-dataset.com/). ## Inner workings: The core operator of Faster R-CNN is the _RoIPool_ which performs coarse spatial quantization for feature extraction but introduce misalignment for pixel-pixel comparison which is what segmentation is. The paper introduce a new layer _RoIAlign_ that faithfully preserves exact spatial location. One important point is that mask and class prediction are decoupled, the segmentation is proposed for each class without competing and the class predictor finally elects the winner. ## Architecture: Based on Faster R-CNN but with an added mask subnetwork that computes a segmentation mask for each class. Different feature extractors and proposers are tried, see two examples below: [![screen shot 2017-05-22 at 7 25 04 pm](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png)](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png) ## Results: Runs at about 200ms per frame on a GPU for segmentation (2 days training on a single 8-GPU) and 5 fps for pose estimation. Very impressive segmentation and pose estimation: [![screen shot 2017-05-22 at 7 26 57 pm 1](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png)](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png) [![screen shot 2017-05-22 at 7 29 26 pm](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png)](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png) |
[link]
_Objective:_ Improve GANs convergence to more diverse and visually pleasing images at higher resolution using a novel equilibrium method between the discriminator and the generator that also simplifies training procedures. _Dataset:_ [LFW](http://vis-www.cs.umass.edu/lfw/) ## Inner workings: They try to match the distribution of the errors (assumed to be normally distributed) instead of matching the distribution of the samples directly. In order to do this they compute the Wasserstein distance between a pixel-wise autoencoder loss distributions of real and generated samples defined as follow: 1. Autoencoder loss: [![screen shot 2017-04-24 at 3 46 32 pm](https://cloud.githubusercontent.com/assets/17261080/25340190/429f9788-2905-11e7-88dc-b44567b9cd34.png)](https://cloud.githubusercontent.com/assets/17261080/25340190/429f9788-2905-11e7-88dc-b44567b9cd34.png) 2. Wasserstein distance for two normal distributions μ1 = N(m1, C1) and μ2 = N(m2, C2) [![screen shot 2017-04-24 at 3 46 44 pm](https://cloud.githubusercontent.com/assets/17261080/25340191/42b23474-2905-11e7-9810-58d5326bf886.png)](https://cloud.githubusercontent.com/assets/17261080/25340191/42b23474-2905-11e7-9810-58d5326bf886.png) They also introduce an equilibrium concept to account for the situation when `G` and `D` are not well balanced and the discriminator `D` wins easily. This is controlled by what they call the diversity ratio that balances between auto-encoding real images and discriminating real from generated images. It is defined as follow: [![screen shot 2017-04-24 at 3 56 29 pm](https://cloud.githubusercontent.com/assets/17261080/25340609/992c2188-2906-11e7-8c51-498bbd293119.png)](https://cloud.githubusercontent.com/assets/17261080/25340609/992c2188-2906-11e7-8c51-498bbd293119.png) To maintain this balance they use a standard SGD but they introduce a variable `kt` initially 0 to control how much emphasis is put on the generator `G`. This removes the need to do `x` steps on `D` followed by `y` steps on `G` or to pretrained one of the two. [![screen shot 2017-04-24 at 3 59 57 pm](https://cloud.githubusercontent.com/assets/17261080/25340859/4ee06476-2907-11e7-971f-90421449cb51.png)](https://cloud.githubusercontent.com/assets/17261080/25340859/4ee06476-2907-11e7-971f-90421449cb51.png) Finally they derive a global convergence measure by using the equilibrium concept that can be used to determine when the network has reached its final state or if the model has collapsed: [![screen shot 2017-04-24 at 4 04 12 pm](https://cloud.githubusercontent.com/assets/17261080/25340998/b8bf6ad6-2907-11e7-8afa-294cae32c6af.png)](https://cloud.githubusercontent.com/assets/17261080/25340998/b8bf6ad6-2907-11e7-8afa-294cae32c6af.png) ## Architecture: They tried to keep the architecture simple to really study the impact of their new equilibrium principle and loss. They don't use batch normalization, dropout, transpose convolutions or exponential growth for convolution filters. [![screen shot 2017-04-24 at 4 09 29 pm](https://cloud.githubusercontent.com/assets/17261080/25341219/6fb7be28-2908-11e7-8774-287c1b7d7684.png)](https://cloud.githubusercontent.com/assets/17261080/25341219/6fb7be28-2908-11e7-8774-287c1b7d7684.png) ## Results: They trained on images from 32x32 to 256x256, but at higher resolution images tend to lose sharpness. Nevertheless images are very very good! [![screen shot 2017-04-24 at 4 20 30 pm](https://cloud.githubusercontent.com/assets/17261080/25341699/f99b0770-2909-11e7-84a0-3ac0436771e5.png)](https://cloud.githubusercontent.com/assets/17261080/25341699/f99b0770-2909-11e7-84a0-3ac0436771e5.png) |
[link]
_Objective:_ Image-to-image translation to perform visual attribute transfer using unpaired images. _Dataset:_ [Cityscapes](https://www.cityscapes-dataset.com/), [CMP Facade](http://cmp.felk.cvut.cz/%7Etylecr1/facade/), [UT Zappos50k](http://vision.cs.utexas.edu/projects/finegrained/utzap50k/) and [ImageNet](http://www.image-net.org/). _Code:_ [CycleGAN](https://github.com/junyanz/CycleGAN) ## Inner-workings: Basically two GANs for each domain with their respective Generator and Discriminator plus two additional losses (called consistency losses) to make sure that translating to the other domain then back yields an image that is still realistic. [![screen shot 2017-06-02 at 10 24 45 am](https://cloud.githubusercontent.com/assets/17261080/26717449/bcd8a9cc-477d-11e7-9137-fd277a0ec04f.png)](https://cloud.githubusercontent.com/assets/17261080/26717449/bcd8a9cc-477d-11e7-9137-fd277a0ec04f.png) For the consistency los they use a pixel-wise L1 norm: [![screen shot 2017-06-02 at 10 31 22 am](https://cloud.githubusercontent.com/assets/17261080/26717733/bc088cdc-477e-11e7-96af-2defa06a1660.png)](https://cloud.githubusercontent.com/assets/17261080/26717733/bc088cdc-477e-11e7-96af-2defa06a1660.png) ## Architecture: Based on [Perceptual losses for real-time style transfer and super-resolution](https://arxiv.org/pdf/1603.08155.pdf), code available [here](https://github.com/jcjohnson/fast-neural-style). Training seems to employ several tricks and then even use a batch of 1. ## Results: Very impressive and the really key point is that you don't need paired images which makes this trainable on any domain with the same representation behind. [![screen shot 2017-06-02 at 10 26 29 am](https://cloud.githubusercontent.com/assets/17261080/26717502/f6d1fb7e-477d-11e7-8174-7bdd621cf1b6.png)](https://cloud.githubusercontent.com/assets/17261080/26717502/f6d1fb7e-477d-11e7-8174-7bdd621cf1b6.png) |
[link]
_Objective:_ Theoretical study of Deep Neural Network, their expressivity and regularizations. ## Results: The key findings of the article are: ### A. Deep neural networks easily fit random labels. Both when randomizing labels, replacing images with raw noise or all situations in-between. 1. The effective capacity of neural networks is sufficient for memorizing the entire data set. 2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels. 3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged. ### B. Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. By explicit regularization they mean batch normalisation, weight decay, dropout, data augmentation, etc. ### C. Generically large neural networks can express any labeling of the training data. More formally, a very simple two-layer ReLU network with `p = 2n + d` parameters can express any labeling of any sample of size `n` in `d` dimensions. ### D. The optimization algorithm itself is implicitly regularizing the solution. SGD acts as an implicit regularizer and properties are inherited by models that were trained using SGD. |
[link]
_Objective:_ Define a framework for Adversarial Domain Adaptation and propose a new architecture as state-of-the-art. _Dataset:_ MNIST, USPS, SVHN and NYUD. ## Inner workings: Subsumes previous work in a generalized framework where designing a new method is now simplified to the space of making three design choices: * whether to use a generative or discriminative base model. * whether to tie or untie the weights. * which adversarial learning objective to use. [![screen shot 2017-04-18 at 5 10 01 pm](https://cloud.githubusercontent.com/assets/17261080/25138167/15d5e644-245a-11e7-9fb8-636ce4111036.png)](https://cloud.githubusercontent.com/assets/17261080/25138167/15d5e644-245a-11e7-9fb8-636ce4111036.png) ## Architecture: [![screen shot 2017-04-18 at 5 14 44 pm](https://cloud.githubusercontent.com/assets/17261080/25138526/07848bd0-245b-11e7-94c9-f6ae7ccea76f.png)](https://cloud.githubusercontent.com/assets/17261080/25138526/07848bd0-245b-11e7-94c9-f6ae7ccea76f.png) ## Results: Interesting as the theoretical framework seem to converge with other papers and their architecture improves on previous papers performance even if it's not a huge improvement. |
[link]
_Objective:_ Specifically adapt Active Learning to Image Classification with deep learning _Dataset:_ [CARC](https://bcsiriuschen.github.io/CARC/) and [Caltech-256](http://authors.library.caltech.edu/7694/) ## Inner-workings: They labels from two sources: * The most informative/uncertain samples are manually labeled using Least confidence, margin sampling and entropy, see [Active Learning Literature Survey](https://github.com/Deepomatic/papers/issues/192). * The other kind is the samples with high prediction confidence that are automatically labelled. They represent the majority of samples. ## Architecture: [![screen shot 2017-06-29 at 3 57 43 pm](https://user-images.githubusercontent.com/17261080/27691277-d4547196-5ce3-11e7-849c-aadd30d71d68.png)](https://user-images.githubusercontent.com/17261080/27691277-d4547196-5ce3-11e7-849c-aadd30d71d68.png) They proceed with the following steps: 1. Initialization: they manually annotate a given number of images for each class in order to pre-trained the network. 2. Complementary sample selection: they fix the network, identity the most uncertain label for manual annotation and automatically annotate the most certain one if their entropy is higher than a given threshold. 3. CNN fine-tuning: they train the network using the whole pool of already labeled data and pseudo-labeled. Then they put all the automatically labeled images back into the unlabelled pool. 4. Threshold updating: as the network gets more and more confident the threshold for auto-labelling is linearly reducing. The idea is that the network gets a more reliable representation and its trustability increases. ## Results: Roughly divide by 2 the number of annotation needed. ⚠️ I don't feel like this paper can be trusted 100% ⚠️ |
[link]
_Objective:_ Robust unsupervised learning of a probability distribution using a new module called the `critic` and the `Earth-mover distance`. _Dataset:_ [LSUN-Bedrooms](http://lsun.cs.princeton.edu/2016/) ## Inner working: Basically train a `critic` until convergence to retrieve the Wasserstein-1 distance, see pseudo-algorithm below: [![screen shot 2017-05-03 at 5 05 09 pm](https://cloud.githubusercontent.com/assets/17261080/25667162/003c9330-3023-11e7-9081-c181011f4e6f.png)](https://cloud.githubusercontent.com/assets/17261080/25667162/003c9330-3023-11e7-9081-c181011f4e6f.png) ## Results: * Easier training: no need for batch normalization and no need to fine-tune generator/discriminator balance. * Less sensitivity to network architecture. * Very good proxy that correlates very well with sample quality. * Non-vanishing gradients. |
[link]
_Objective:_ Predict labels using a very large dataset with noisy labels and a much smaller (3 orders of magnitude) dataset with human-verified annotations. _Dataset:_ [Open image](https://research.googleblog.com/2016/09/introducing-open-images-dataset.html) ## Architecture: Contrary to other approaches they use the clean labels, the noisy labels but also image features. They basically train 3 networks: 1. A feature extractor for the image. 2. A label Cleaning Network that predicts to learn verified labels from noisy labels + image feature. 3. An image classifier that predicts using just the image. [![screen shot 2017-04-12 at 11 10 56 am](https://cloud.githubusercontent.com/assets/17261080/24950258/c4764106-1f70-11e7-82e4-c1111ffc089e.png)](https://cloud.githubusercontent.com/assets/17261080/24950258/c4764106-1f70-11e7-82e4-c1111ffc089e.png) ## Results: Overall better performance but not breath-taking improvement: from `AP 83.832 / MAP 61.82` for a NN trained only on labels to `AP 87.67 / MAP 62.38` with their approach. |
[link]
_Objective:_ Find a generative model that avoids usual shortcomings: (i) high-resolution, (ii) variety of images and (iii) matching the dataset diversity. _Dataset:_ [ImageNet](https://www.image-net.org/) ## Inner-workings: The idea is to find an image that maximizes the probability for a given label by using a variant of a Markov Chain Monte Carlo (MCMC) sampler. [![screen shot 2017-06-01 at 12 31 14 pm](https://cloud.githubusercontent.com/assets/17261080/26675978/3c9e6d94-46c6-11e7-9f67-477c4036a891.png)](https://cloud.githubusercontent.com/assets/17261080/26675978/3c9e6d94-46c6-11e7-9f67-477c4036a891.png) Where the first term ensures that we stay in the image manifold that we're trying to find and don't just produce adversarial examples and the second term makes sure that find an image corresponding to the label we're looking for. Basically we start with a random image and iteratively find a better image to match the label we're trying to generate. ### MALA-approx: MALA-approx is the MCMC sampler based on the Metropolis-Adjusted Langevin Algorithm that they use in the paper, it is defined iteratively as follow: [![screen shot 2017-06-01 at 12 25 45 pm](https://cloud.githubusercontent.com/assets/17261080/26675866/bf15cc28-46c5-11e7-9620-659d26f84bf8.png)](https://cloud.githubusercontent.com/assets/17261080/26675866/bf15cc28-46c5-11e7-9620-659d26f84bf8.png) where: * epsilon1 makes the image more generic. * epsilon2 increases confidence in the chosen class. * epsilon3 adds noise to encourage diversity. ### Image prior: They try several priors for the images: 1. PPGN-x: p(x) is modeled with a Denoising Auto-Encoder (DAE). [![screen shot 2017-06-01 at 1 48 33 pm](https://cloud.githubusercontent.com/assets/17261080/26678501/1737c64e-46d1-11e7-82a4-7ee0aa8bfe2f.png)](https://cloud.githubusercontent.com/assets/17261080/26678501/1737c64e-46d1-11e7-82a4-7ee0aa8bfe2f.png) 2. DGN-AM: use a latent space to model x with h using a GAN. [![screen shot 2017-06-01 at 1 49 41 pm](https://cloud.githubusercontent.com/assets/17261080/26678517/2e743194-46d1-11e7-95dc-9bb638128242.png)](https://cloud.githubusercontent.com/assets/17261080/26678517/2e743194-46d1-11e7-95dc-9bb638128242.png) 3. PPGN-h: incorporates a prior for p(h) using a DAE. [![screen shot 2017-06-01 at 1 51 14 pm](https://cloud.githubusercontent.com/assets/17261080/26678579/6bd8cb58-46d1-11e7-895d-f9432b7e5e1f.png)](https://cloud.githubusercontent.com/assets/17261080/26678579/6bd8cb58-46d1-11e7-895d-f9432b7e5e1f.png) 4. Joint PPGN-h: to increases expressivity of G, model h by first modeling x in the DAE. [![screen shot 2017-06-01 at 1 51 23 pm](https://cloud.githubusercontent.com/assets/17261080/26678622/a7bf2f68-46d1-11e7-9209-98f97e0a218d.png)](https://cloud.githubusercontent.com/assets/17261080/26678622/a7bf2f68-46d1-11e7-9209-98f97e0a218d.png) 5. Noiseless joint PPGN-h: same as previous but without noise. [![screen shot 2017-06-01 at 1 54 11 pm](https://cloud.githubusercontent.com/assets/17261080/26678655/d5499220-46d1-11e7-93d0-d48a6b6fa1a8.png)](https://cloud.githubusercontent.com/assets/17261080/26678655/d5499220-46d1-11e7-93d0-d48a6b6fa1a8.png) ### Conditioning: In the paper they mostly use conditioning on label but captions or pretty much anything can also be used. [![screen shot 2017-06-01 at 2 26 53 pm](https://cloud.githubusercontent.com/assets/17261080/26679654/6297ab86-46d6-11e7-86fa-f763face01ca.png)](https://cloud.githubusercontent.com/assets/17261080/26679654/6297ab86-46d6-11e7-86fa-f763face01ca.png) ## Architecture: The final architecture using a pretrained classifier network is below. Note that only G and D are trained. [![screen shot 2017-06-01 at 2 29 49 pm](https://cloud.githubusercontent.com/assets/17261080/26679785/db143520-46d6-11e7-9668-72864f1a8eb1.png)](https://cloud.githubusercontent.com/assets/17261080/26679785/db143520-46d6-11e7-9668-72864f1a8eb1.png) ## Results: Pretty much any base network can be used with minimal training of G and D. It produces very realistic image with a great diversity, see below for examples of 227x227 images with ImageNet. [![screen shot 2017-06-01 at 2 32 38 pm](https://cloud.githubusercontent.com/assets/17261080/26679884/4494002a-46d7-11e7-882e-c69aff2ddd17.png)](https://cloud.githubusercontent.com/assets/17261080/26679884/4494002a-46d7-11e7-882e-c69aff2ddd17.png) |
[link]
_Objective:_ Train on both classification and detection image to make a better faster and stronger detector. _Dataset:_ [ImageNet](http://www.image-net.org/), [COCO](http://mscoco.org/) and [WordNet](https://wordnet.princeton.edu/). ## Architecture: Apart from amelioration such as batch norm or other general tweaking the real improvements come from: 1. Using both a classification dataset and a detection dataset at the same time. 2. Replacing the usual final soft-max layer (which assumes that all labels are mutually exclusive) with a WordTree label hierarchy base on WordNet which enables the network to predict `dog` even if it doesn't know if it's a `Fox Terrier`. [![screen shot 2017-04-12 at 7 24 28 pm](https://cloud.githubusercontent.com/assets/17261080/24970727/b7abaf02-1fb5-11e7-8b78-2a430a861cbd.png)](https://cloud.githubusercontent.com/assets/17261080/24970727/b7abaf02-1fb5-11e7-8b78-2a430a861cbd.png) ## Results: State of the art results at full resolution and possibility to lower performance to gain in computation time. [![screen shot 2017-04-12 at 7 31 26 pm](https://cloud.githubusercontent.com/assets/17261080/24971010/a51556f8-1fb6-11e7-9289-fc277b182686.png)](https://cloud.githubusercontent.com/assets/17261080/24971010/a51556f8-1fb6-11e7-9289-fc277b182686.png) |
[link]
_Objective:_ Fondamental analysis of random networks using mean-field theory. Introduces two scales controlling network behavior. ## Results: Guide to choose hyper-parameters for random networks to be nearly critical (in between order and chaos). This in turn implies that information can propagate forward and backward and thus the network is trainable (not vanishing or exploding gradient). Basically for any given number of layers and initialization covariances for weights and biases, tells you if the network will be trainable or not, kind of an architecture validation tool. **To be noted:** any amount of dropout removes the critical point and therefore imply an upper bound on trainable network depth. ## Caveats: * Consider only bounded activation units: no relu, etc. * Applies directly only to fully connected feed-forward networks: no convnet, etc. |
[link]
_Objective:_ Compare several meta-architectures and hyper-parameters in the same framework for easy comparison. ## Architectures: Four meta architectures: 1. R-CNN 2. Faster R-CNN 3. SSD 4. YOLO Architecture (not evaluated in the paper) [![screen shot 2017-05-05 at 3 12 57 pm](https://cloud.githubusercontent.com/assets/17261080/25746807/5a294360-31a5-11e7-808e-d48497a16cd5.png)](https://cloud.githubusercontent.com/assets/17261080/25746807/5a294360-31a5-11e7-808e-d48497a16cd5.png) ## Results: Very interesting to know which framework to implement or not at first glance. |
[link]
_Objective:_ Design a network that will itself find the best architecture for a given task. _Dataset:_ [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [PTB](https://catalog.ldc.upenn.edu/ldc99t42). ## Inner-workings: The meta-network (a RNN) generates a string specifying the child network parameters. Such a child network is then trained for 35-50 epochs and its accuracy is used as the reward to train the meta-network with Reinforcement Learning. The RNN first generates networks with few layers (6) then this number is increased as training progresses. ## Architecture: They develop one architecture for CNN where they predict each layers characteristic plus it's possible skip-connection: [![screen shot 2017-05-24 at 8 13 01 am](https://cloud.githubusercontent.com/assets/17261080/26389176/d807de42-4058-11e7-942a-8a129558e126.png)](https://cloud.githubusercontent.com/assets/17261080/26389176/d807de42-4058-11e7-942a-8a129558e126.png) And one specific for LTSM-style: [![screen shot 2017-05-24 at 8 13 26 am](https://cloud.githubusercontent.com/assets/17261080/26389190/e2bfd506-4058-11e7-9168-62abd040156e.png)](https://cloud.githubusercontent.com/assets/17261080/26389190/e2bfd506-4058-11e7-9168-62abd040156e.png) ## Distributed setting: Bellow is the distributed setting that they use with parameter servers connected to replicas (GPUs) that trained child networks. [![screen shot 2017-05-24 at 8 09 05 am](https://cloud.githubusercontent.com/assets/17261080/26389084/5e354456-4058-11e7-83a9-089cb2c115b7.png)](https://cloud.githubusercontent.com/assets/17261080/26389084/5e354456-4058-11e7-83a9-089cb2c115b7.png) ## Results: Overall they trained 12800 networks on 800 GPUs but they achieve state of the art results which not human intervention except the vocabulary selection (activation type, type of cells, etc). Next step, transfer learning from one task to another for the meta-network? |
[link]
_Objective:_ Build a network easily trainable by back-propagation to perform unsupervised domain adaptation while at the same time learning a good embedding for both source and target domains. _Dataset:_ [SVHN](ufldl.stanford.edu/housenumbers/), [MNIST](yann.lecun.com/exdb/mnist/), [USPS](https://www.otexts.org/1577), [CIFAR](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [STL](https://cs.stanford.edu/%7Eacoates/stl10/). ## Architecture: Very similar to RevGrad but with some differences. Basically a shared encoder and then a classifier and a reconstructor. [![screen shot 2017-05-22 at 6 11 22 pm](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png)](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png) The two losses are: * the usual cross-entropy with softmax for the classifier * the pixel-wise squared loss for reconstruction Which are then combined using a trade-off hyper-parameter between classification and reconstruction. They also use data augmentation to generate additional training data during the supervised training using only geometrical deformation: translation, rotation, skewing, and scaling Plus denoising to reconstruct clean inputs given their noisy counterparts (zero-masked noise and Gaussian noise). ## Results: Outperforms state of the art on most tasks at the time, now outperformed itself by Generate To Adapt on most tasks. |
[link]
_Objective:_ Find a feature representation that cannot discriminate between the training (source) and test (target) domains using a discriminator trained directly on this embedding. _Dataset:_ MNIST, SYN Numbers, SVHN, SYN Signs, OFFICE, PRID, VIPeR and CUHK. ## Architecture: The basic idea behind this paper is to use a standard classifier network and chose one layer that will be the feature representation. The network before this layer is called the `Feature Extractor` and after the `Label Predictor`. Then a new network called a `Domain Classifier` is introduced that takes as input the extracted feature, its objective is to tell if a computed feature embedding came from an image from the source or target dataset. At training the aim is to minimize the loss of the `Label Predictor` while maximizing the loss of the `Domain Classifier`. In theory we should end up with a feature embedding where the discriminator can't tell if the image came from the source or target domain, thus the domain shift should have been eliminated. To maximize the domain loss, a new layer is introduced, the `Gradient Reversal Layer` which is equal to the identity during the forward-pass but reverse the gradient in the back-propagation phase. This enables the network to be trained using simple gradient descent algorithms. What is interesting with this approach is that any initial network can be used by simply adding a few new set of layers for the domain classifiers. Below is a generic architecture. [![screen shot 2017-04-18 at 1 59 53 pm](https://cloud.githubusercontent.com/assets/17261080/25129680/590f57ee-243f-11e7-8927-91124303b584.png)](https://cloud.githubusercontent.com/assets/17261080/25129680/590f57ee-243f-11e7-8927-91124303b584.png) ## Results: Their approach is working but for some domain adaptation it completely fails and overall its performance are not great. Since then the state-of-the-art has changed, see DANN combined with GAN or ADDA. |
[link]
_Objective:_ Improve on Fast R-CNN and [SPPnet](https://arxiv.org/abs/1406.4729) by incorporating the region proposal network directly. _Dataset:_ [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/) and [COCO](http://mscoco.org/). Both Fast R-CNN and SPPnet takes as input an image and several possibles objects (corresponding to regions of interest) and score each of them. They are thus two different entities: 1. A region proposal network. 2. A classification/detection network (Fast R-CNN/SSPnet). ## Architecture: First image features are extracted using a state of the art ConvNet, then they are used for both Region proposal and actual detection/classification on those regions. [![screen shot 2017-04-14 at 2 59 28 pm](https://cloud.githubusercontent.com/assets/17261080/25043807/01a287b6-2123-11e7-944c-01493371df29.png)](https://cloud.githubusercontent.com/assets/17261080/25043807/01a287b6-2123-11e7-944c-01493371df29.png) ## Results: By incorporating the region proposal network right after the feature ConvNet its computation cost becomes basically free which leads to an elegant solution (only one network) but more importantly greatly improve speed at test time. |
[link]
_Objective:_ Solve the degradation problem where adding layers induces a higher training error. _Dataset:_ [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html), [PASCAL](http://host.robots.ox.ac.uk/pascal/VOC/) and [COCO](http://mscoco.org/). ## Inner-workings: They argue that it is easier to learn the difference to the identity (the residual) than the actual mapping. Basically they start with the identity and learn the residual mapping. This allows for easier training and thus deeper network. ## Architecture: They introduce two new building block for Residual Networks, depending on the input dimensionality: [![screen shot 2017-05-31 at 3 49 59 pm](https://cloud.githubusercontent.com/assets/17261080/26635061/d489dbe2-4618-11e7-911e-68772265ee9f.png)](https://cloud.githubusercontent.com/assets/17261080/26635061/d489dbe2-4618-11e7-911e-68772265ee9f.png) [![screen shot 2017-05-31 at 3 57 47 pm](https://cloud.githubusercontent.com/assets/17261080/26635420/f6f22af8-4619-11e7-9639-ed651f8b18bb.png)](https://cloud.githubusercontent.com/assets/17261080/26635420/f6f22af8-4619-11e7-9639-ed651f8b18bb.png) That can then be chained to produce network such as: [![screen shot 2017-05-31 at 3 54 16 pm](https://cloud.githubusercontent.com/assets/17261080/26635258/7b64530c-4619-11e7-81c8-5d6be547da77.png)](https://cloud.githubusercontent.com/assets/17261080/26635258/7b64530c-4619-11e7-81c8-5d6be547da77.png) ## Results: Won most 1st places, very impressive and adding layers do increase accuracy. |
[link]
_Objective:_ Propose a more stable set of architectures for training GAN and show that they learn good representations of images for supervised learning and generative modeling. _Dataset:_ [LSUN](http://www.yf.io/p/lsun) and [ImageNet 1k](www.image-net.org/). ## Architecture: Below are the guidelines for making DCGANs. [![screen shot 2017-04-24 at 10 58 17 am](https://cloud.githubusercontent.com/assets/17261080/25329644/f3885f7c-28dc-11e7-8895-051124c8ff6c.png)](https://cloud.githubusercontent.com/assets/17261080/25329644/f3885f7c-28dc-11e7-8895-051124c8ff6c.png) And here is a sample network: [![screen shot 2017-04-24 at 10 57 54 am](https://cloud.githubusercontent.com/assets/17261080/25329634/e9c14abc-28dc-11e7-8bed-068f7f7bc78d.png)](https://cloud.githubusercontent.com/assets/17261080/25329634/e9c14abc-28dc-11e7-8bed-068f7f7bc78d.png) A tensorflow implementation can be found [here](https://github.com/carpedm20/DCGAN-tensorflow) along with an [online demo](https://carpedm20.github.io/faces/). ## Results: Quite interesting especially concerning the structure learned in the Z-space and how this can be used for interpolation or object removal, see the example that is shown everywhere: [![screen shot 2017-04-24 at 11 20 03 am](https://cloud.githubusercontent.com/assets/17261080/25330458/080b6b4e-28e0-11e7-9ab6-ce58ef5b5562.png)](https://cloud.githubusercontent.com/assets/17261080/25330458/080b6b4e-28e0-11e7-9ab6-ce58ef5b5562.png) Nonetheless the network is still generating small images (32x32). |
[link]
Automatically learn which Active Learning strategy to use. _Code:_ [here](https://github.com/ntucllab/libact) ## Inner-workings: They use the multi-armed bandit framework where each arm is an Active Learning strategy. The core RL algorithm used is [EXP4.P](https://arxiv.org/abs/1002.4058) which is itself based on EXP4 (**Exp**onential weighting for **Exp**loration and **Exp**lotation with **Exp**erts). They make only slight adjustments to the reward function. ## Algorithm: [![screen shot 2017-06-14 at 7 33 46 pm](https://user-images.githubusercontent.com/17261080/27146101-6d8392b4-5138-11e7-8e12-5617b258ddfa.png)](https://user-images.githubusercontent.com/17261080/27146101-6d8392b4-5138-11e7-8e12-5617b258ddfa.png) ## Results: Beats all other techniques most of the time and make sure that in the long run we use the best strategy. |
[link]
Improve on [R-CNN](https://arxiv.org/abs/1311.2524) and [SPPnet](https://arxiv.org/abs/1406.4729) with easier and faster training. Region-based Convolutional Neural Network (R-CNN), basically takes as input and image and several possibles objects (corresponding to Region of Interest) and score each of them. ## Architecture: The feature map is computed for the whole image and then for each region of interest a new fixed-length feature vector is computed using max-pooling. From it two predictions are made for classification and bounding-box offsets. [![screen shot 2017-04-14 at 12 46 38 pm](https://cloud.githubusercontent.com/assets/17261080/25041460/6e7cba40-2110-11e7-8650-faae2a6b0a92.png)](https://cloud.githubusercontent.com/assets/17261080/25041460/6e7cba40-2110-11e7-8650-faae2a6b0a92.png) ## Results: By sharing computation for RoIs of the same image and allowing simple SGD training it really improves performance training although at testing it's still not as fast as YOLO9000. |
[link]
Network training is very sensitive to learning rate and initialization factors. Each layer output distribution is different than its input distribution (called covariate shift) which implies that layers have to permanently adapt to new input distribution. In this paper the author introduce batch normalization, a new layer to reduce covariate shift. _Dataset:_ [MNIST](http://yann.lecun.com/exdb/mnist/), [ImageNet](www.image-net.org/). #### Inner workings: Batch normalization fixes the means and variances of layer inputs for a training batch by computing the following normalization on each batch. [![screen shot 2017-04-13 at 10 21 39 am](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png) The parameters Gamma and Beta are then learned with a gradient descent. During inference the statistics are computed using unbiased estimators of the whole dataset (and not just the batch). #### Results: Batch normalization provides several advantages: 1. Use of a higher learning rate without risk of divergence by stabilizing the gradient scale. 2. Regularizes the model. 3. Reduces the need for dropout. 4. Avoid the network to get stuck when using saturating nonlinearities. #### What to do? 1. Add batch norm layer before activation layers. 2. Increase the learning rate. 3. Remove dropout. 4. Reduce L2 weight regularization. 5. Accelerate learning rate decay. 6. Reduce picture distorsion for data augmentation. |
[link]
_Objective:_ Design a loss to make deep network robust to label noise. _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), Toroto Faces Database, [ILSVRC2014](http://www.image-net.org/challenges/LSVRC/2014/). #### Inner-workings: Three types of losses are presented: * reconstruciton loss: [![screen shot 2017-06-26 at 11 00 07 am](https://user-images.githubusercontent.com/17261080/27532200-bb42b8a6-5a5f-11e7-8c14-673958216bfc.png)](https://user-images.githubusercontent.com/17261080/27532200-bb42b8a6-5a5f-11e7-8c14-673958216bfc.png) * soft bootstrapping which uses the predicted labels by the network `qk` and the user-provided labels `tk`: [![screen shot 2017-06-26 at 11 10 43 am](https://user-images.githubusercontent.com/17261080/27532296-1e01a420-5a60-11e7-9273-d1affb0d7c2e.png)](https://user-images.githubusercontent.com/17261080/27532296-1e01a420-5a60-11e7-9273-d1affb0d7c2e.png) * hard bootstrapping replaces the soft predicted labels by their binary version: [![screen shot 2017-06-26 at 11 12 58 am](https://user-images.githubusercontent.com/17261080/27532439-a3f9dbd8-5a60-11e7-91a7-327efc748eae.png)](https://user-images.githubusercontent.com/17261080/27532439-a3f9dbd8-5a60-11e7-91a7-327efc748eae.png) [![screen shot 2017-06-26 at 11 13 05 am](https://user-images.githubusercontent.com/17261080/27532463-b52f4ab4-5a60-11e7-9aed-615109b61bd8.png)](https://user-images.githubusercontent.com/17261080/27532463-b52f4ab4-5a60-11e7-9aed-615109b61bd8.png) #### Architecture: They test with Feed Forward Neural Networks only. #### Results: They use only permutation noise with a very high probability compared with what we might encounter in real-life. [![screen shot 2017-06-26 at 11 29 05 am](https://user-images.githubusercontent.com/17261080/27533105-b051d366-5a62-11e7-95f3-168d0d2d7841.png)](https://user-images.githubusercontent.com/17261080/27533105-b051d366-5a62-11e7-95f3-168d0d2d7841.png) The improvement for small noise probability (<10%) might not be that interesting. |
[link]
_Objective:_ Build a network easily trainable by back-propagation to perform unsupervised domain adaptation while at the same time learning a good embedding for both source and target domains. _Dataset:_ [SVHN](ufldl.stanford.edu/housenumbers/), [MNIST](yann.lecun.com/exdb/mnist/), [USPS](https://www.otexts.org/1577), [CIFAR](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [STL](https://cs.stanford.edu/%7Eacoates/stl10/). #### Architecture: Very similar to RevGrad but with some differences. Basically a shared encoder and then a classifier and a reconstructor. [![screen shot 2017-05-22 at 6 11 22 pm](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png)](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png) The two losses are: * the usual cross-entropy with softmax for the classifier * the pixel-wise squared loss for reconstruction Which are then combined using a trade-off hyper-parameter between classification and reconstruction. They also use data augmentation to generate additional training data during the supervised training using only geometrical deformation: translation, rotation, skewing, and scaling Plus denoising to reconstruct clean inputs given their noisy counterparts (zero-masked noise and Gaussian noise). #### Results: Outperforms state of the art on most tasks at the time, now outperformed itself by Generate To Adapt on most tasks. |
[link]
_Objective:_ In an unconditional GAN it's not possible to control the mode of the data being generated which is what this paper tries to accomplish using the label data (but it can be generalized to any kind of conditional data). _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/) and [MIRFLICKR](http://press.liacs.nl/mirflickr/). #### Inner workings: Changes the loss to the conditional loss: [![screen shot 2017-04-24 at 10 07 25 am](https://cloud.githubusercontent.com/assets/17261080/25327832/e86f53fe-28d5-11e7-8694-6df8f2e1ef18.png)](https://cloud.githubusercontent.com/assets/17261080/25327832/e86f53fe-28d5-11e7-8694-6df8f2e1ef18.png) For implementation the only thing needed is to feed the label data to both the discriminator and generator: [![screen shot 2017-04-24 at 10 07 18 am](https://cloud.githubusercontent.com/assets/17261080/25327826/e53ab4a8-28d5-11e7-8056-1518602d50c9.png)](https://cloud.githubusercontent.com/assets/17261080/25327826/e53ab4a8-28d5-11e7-8056-1518602d50c9.png) #### Results: Interesting at the time but not surprising now. There's not much more to the paper than what is in the summary. |
[link]
_Objective:_ Transfer feature learned from large-scale dataset to small-scale dataset _Dataset:_ [ImageNet](www.image-net.org), [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/). #### Inner-workings: Basically they train the network on the large dataset, then replace the last layers, sometimes adding a new one and train this on the new dataset. Pretty standard transfer learning nowadays. [![screen shot 2017-06-14 at 3 06 37 pm](https://user-images.githubusercontent.com/17261080/27133634-2d4c0fde-5113-11e7-848a-719514b1a12c.png)](https://user-images.githubusercontent.com/17261080/27133634-2d4c0fde-5113-11e7-848a-719514b1a12c.png) What's a bit more interesting is how they deal with background being overrepresented by using the bounding box that they have. [![screen shot 2017-06-14 at 3 06 43 pm](https://user-images.githubusercontent.com/17261080/27133641-34d4ee7e-5113-11e7-8307-f1ff708bd5c7.png)](https://user-images.githubusercontent.com/17261080/27133641-34d4ee7e-5113-11e7-8307-f1ff708bd5c7.png) #### Results: A bit dated, not really applicable but the part on specifically tackling the domain shift (such as background) is interesting. Plus they use the bounding-box information to refine the dataset. |
[link]
_Objective:_ Introduces the Convolutional Auto-Encoder, a hierarchical unsupervised feature extractor. _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/) and [SVHN](ufldl.stanford.edu/housenumbers/). #### Architecture: Uses convolutions to generate an encoding of the image and then decodes it and do a pixel-wise comparison. Used to initializes CNN. #### Results: Old article, not really relevant nowadays. They don't speak about the deconvolution part. |
[link]
_Objective:_ Define a new deconvolution layer. #### Results: Not really interesting except from the fact that it first introduces **deconvolution layers** which are very ill-name as they are not actual deconvolution but instead a **transposed convolution** or also called a **fractionally strided convolutions**. [![Deconvolutional layer](https://cloud.githubusercontent.com/assets/17261080/25344392/44693b48-2912-11e7-8dda-2b64d99292a9.gif)](https://cloud.githubusercontent.com/assets/17261080/25344392/44693b48-2912-11e7-8dda-2b64d99292a9.gif) Visualization for other operations can be seen [here](https://github.com/vdumoulin/conv_arithmetic) corresponding to [A guide to convolution arithmetic for deep learning](https://arxiv.org/pdf/1603.07285.pdf). |
[link]
Very good introduction to active learning. #### Scenarios There are three mains scenari: * Pool-based: a large amount of unlabeled data is available and we need to chose which one to annotate next. * Stream-based: same as above except example come one after the other. * Membership query synthesis: we can generate the point to label. #### Query Strategy Frameworks 2.1. Uncertainty Sampling Basically how to evaluate the informativeness of unlabeled instances and then select the most informative. 2.1.1. Least Confident Query the instances about which the algorithm is least certain how to label. [![screen shot 2017-06-14 at 5 08 37 pm](https://user-images.githubusercontent.com/17261080/27139765-281f1374-5124-11e7-9418-fb458be0bfc3.png)](https://user-images.githubusercontent.com/17261080/27139765-281f1374-5124-11e7-9418-fb458be0bfc3.png) [![screen shot 2017-06-14 at 5 09 36 pm](https://user-images.githubusercontent.com/17261080/27139841-5636458e-5124-11e7-95c4-ea586deb853a.png)](https://user-images.githubusercontent.com/17261080/27139841-5636458e-5124-11e7-95c4-ea586deb853a.png) Most used by discard information on all other labels. 2.1.2. Margin Sampling Use the first two labels and chose the instance for which the different between the two is the smallest. [![screen shot 2017-06-14 at 5 12 29 pm](https://user-images.githubusercontent.com/17261080/27139968-aabebe6a-5124-11e7-879b-f518e2279eba.png)](https://user-images.githubusercontent.com/17261080/27139968-aabebe6a-5124-11e7-879b-f518e2279eba.png) 2.1.3. Entropy Instead of using the two first labels, why not use all of them? [![screen shot 2017-06-14 at 5 13 44 pm](https://user-images.githubusercontent.com/17261080/27140049-e33ea25a-5124-11e7-84ea-adab87d29174.png)](https://user-images.githubusercontent.com/17261080/27140049-e33ea25a-5124-11e7-84ea-adab87d29174.png) #### Query-By-Committee A committee of different models is trained. They then vote on which instance to label and the one for which they most disagree is chosen. To measure the level of disagreement, one can either use: * Vote entropy: [![screen shot 2017-06-14 at 5 20 26 pm](https://user-images.githubusercontent.com/17261080/27140436-d12d330a-5125-11e7-8f40-7be3bbc83987.png)](https://user-images.githubusercontent.com/17261080/27140436-d12d330a-5125-11e7-8f40-7be3bbc83987.png) * Kullback-Leibler divergence: [![screen shot 2017-06-14 at 5 21 32 pm](https://user-images.githubusercontent.com/17261080/27140492-f45be722-5125-11e7-9b42-204aaf4bdd92.png)](https://user-images.githubusercontent.com/17261080/27140492-f45be722-5125-11e7-9b42-204aaf4bdd92.png) [![screen shot 2017-06-14 at 5 22 29 pm](https://user-images.githubusercontent.com/17261080/27140537-12289cd2-5126-11e7-8e1d-62158576cd95.png)](https://user-images.githubusercontent.com/17261080/27140537-12289cd2-5126-11e7-8e1d-62158576cd95.png) #### Expected Model Change Selects the instance that would impart the greatest change to the current model if we knew its label. * Expected Gradient Length: compute the gradient for all instances and find the one with the largest magnitude on average for all labels. [![screen shot 2017-06-14 at 5 25 20 pm](https://user-images.githubusercontent.com/17261080/27140694-79cc6e4a-5126-11e7-9314-e837a1e0eba2.png)](https://user-images.githubusercontent.com/17261080/27140694-79cc6e4a-5126-11e7-9314-e837a1e0eba2.png) #### Expected Error Reduction Measure not how much the model is likely to change, but how much its generalization error is likely to be reduced. Either by measuring: * Expected 0/1 loss: to reduce the expected total number of incorrect predictions. A new model needs to be trained for every label and instance, very greedy. [![screen shot 2017-06-14 at 5 28 42 pm](https://user-images.githubusercontent.com/17261080/27140912-08d7410a-5127-11e7-9d53-33f2044692a2.png)](https://user-images.githubusercontent.com/17261080/27140912-08d7410a-5127-11e7-9d53-33f2044692a2.png) * Expected Log-Loss: maximizing the expected information gain of the query. Still very greedy in computation! Not really usable except if the model can be analytically resolved instead of re-trained. [![screen shot 2017-06-14 at 5 30 42 pm](https://user-images.githubusercontent.com/17261080/27140970-3e117516-5127-11e7-9936-671fea5d94dd.png)](https://user-images.githubusercontent.com/17261080/27140970-3e117516-5127-11e7-9936-671fea5d94dd.png) #### Variance Reduction Reduce generalization error indirectly by minimizing the output variance. [![screen shot 2017-06-14 at 5 38 17 pm](https://user-images.githubusercontent.com/17261080/27141417-6507b71a-5128-11e7-81ca-ab227836098f.png)](https://user-images.githubusercontent.com/17261080/27141417-6507b71a-5128-11e7-81ca-ab227836098f.png) #### Density-Weighted Methods [![screen shot 2017-06-14 at 5 40 53 pm](https://user-images.githubusercontent.com/17261080/27141501-a920bd34-5128-11e7-8e9d-0870da365633.png)](https://user-images.githubusercontent.com/17261080/27141501-a920bd34-5128-11e7-8e9d-0870da365633.png) With the left function is the informativeness of x and the right function represents average similarity to all other instances in the input distribution |