Summary by Martin Thoma 6 years ago
This paper is about transfer learning for computer vision tasks.
* Before this paper, people focused on similar datasets (e.g. ImageNet-like images) or even the same dataset but a different task (classification -> segmentation). This paper, they look at extremely different dataset (ImageNet-like vs text) but only one task (classification). They show that all layers can be shared (including the last classification layer) between datasets such as MNIST and CIFAR-10
* Normalizing information is necessary for sharing models between datasets in order to compensate for dataset-specific differences. Domain-specific scaling parameters work well.
* Used datasets:
1. MNIST (10 classes: handwritten digits 0-9),
2. SVHN (10 classes: house number digits, 0-9),
3. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) (10 classes: airplane, automobile, bird, ...)
4. Daimler Mono Pedestrian Classification Benchmark (18 × 36 pixels)
5. Human Sketch dataset (20000 human sketches of every day objects such as “book”, “car”, “house”, “sun”)
6. German Traffic Sign Recognition (GTSR) Benchmark (43 traffic signs)
7. Plankton imagery data (classification benchmark that contains 30336 images of various organisms ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies)
8. Animals with Attributes (AwA): 30475 images of 50 animal species (for zero-shot learning)
9. Caltech-256: object classification benchmark (256 object categories and an additional background class)
10. Omniglot: 1623 different handwritten characters from 50 different alphabets (one shot learning)
* images are resized to 64 × 64 pixels, greyscale ones are converted into RGB by setting the three channels to the same value
* Each dataset is also whitened, by subtracting its mean and dividing it by its standard deviation per channel
* **Architecture**: ResNet + Global Average Pooling + FC with Softmax
* "As the majority of the datasets have a different number of classes, we use a dataset-specific fully connected layer in our experiments unless otherwise stated."
* **Data augmentation**: We follow the same data augmentation strategy in [](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15), the 64 × 64 size whitened image is padded with 8 pixels on all sides and a 64×64 patch randomly sampled from the padded image or its horizontal flip (except for MNIST / Omniglot / SVHN, as those contain text)
* **Training**: stochastic gradient descent with momentum
1. Baseline: Train networks for each dataset independantly
2. Full sharing: For MNIST / SVHN / CIFAR-10, group classes randomly together so that Node 2 might be digit "7" for MNIST, digit "3" for SVHN and "aeroplane" for CIFAR-10. They are trained together in one network.
3. Deep sharing: Share all layers except the last one. Use all 10 datasets for this.
4. Partial sharing: Have a dataset-specific first part to compensate for different image statistics, but share the middle of the network.
The results seem to be inconclusive to me.
## Follow-up / related work