The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data.
#### What is DRAW:
The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder.
#### What do we gain?
The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem.
#### What follows?
A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network.
* As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way.
* The attention model is fully differentiable.
* I think a better exposition of the attention mechanism would improve this paper.
This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
- End-to-end model; Localization and classification errors can be jointly
- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).
## Weaknesses / Notes
- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).
- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks.
Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that:
1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned
2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost
To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates.
All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper.
The authors have a dataset of 780 electronic health records and they use it to detect various medical events such as adverse drug events, drug dosage, etc. The task is done by assigning a label to each word in the document.
Annotation statistics for the corpus of health records.
They look at CRFs, LSTMs and GRUs. Both LSTMs and GRUs outperform the CRF, but the best performance is achieved by a GRU trained on whole documents.
The method is a multi-task learning model performing person detection, keypoint detection, person segmentation, and pose estimation. It is a bottom-up approach as it first localizes identity-free semantics and then group them into instances.
- **Backbone**. A feature extractor is presented by ResNet-(50 or 101) with one [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) (FPN) for keypoint branch and one for person detection branch. FPN enhances extracted features through multi-level representation.
- **Keypoint detection** detects keypoints as well as produces a pixel-level segmentation mask.
FPN features $K_i$ are processed with multiple $3\times3$ convolutions followed by concatenation and final $1\times1$ convolution to obtain predictions for each keypoint, as well as segmentation mask (see Figure for details). This results in #keypoints_in_dataset_per_person + 1 output layers. Additionally, intermediate supervision (i.e. loss) is applied at the FPN outputs. $L_2$ loss between predictions and Gaussian peaks at the keypoint locations is used. Similarly, $L_2$ loss is applied for segmentation predictions and corresponding ground truth masks.
- **Person detection** is essentially a [RetinaNet](https://arxiv.org/pdf/1708.02002.pdf), a one-stage object detector, modified to only handle *person* class.
- **Pose estimation**. Given initial keypoint predictions, Pose Estimation Network (PRN) selects a single keypoint for each class.
During inference, PRN takes cropped outputs from keypoint detection branch defined by the predicted bounding boxes from the person detection branch, resizes it to a fixed size, and forwards it through a multilayer perceptron with residual connection. During the training, the same process is performed, except the cropped keypoints come from the ground truth annotation defined by a labeled bounding box.
This model is not an end-to-end trainable model. While keypoint and person detection branches can, in theory, be trained simultaneously, PRN network requires separate training.
**Personal note**. Interestingly, PRN training with ground truth inputs (i.e. "perfect" inputs) only reaches 89.4 mAP validation score which is surprisingly quite far from the max possible score. This presumably means that even if preceding networks or branches perform god-like, the PRN might become a bottleneck in the performance. Therefore, more efforts should be directed to PRN itself. Moreover, modifying the network to support end-to-end training might help in boosting the performance.
Open-source implementations used to make sure the paper apprehension is correct: [link1](https://github.com/LiMeng95/MultiPoseNet.pytorch), [link2](https://github.com/IcewineChen/pytorch-MultiPoseNet).